Dhall-haskell not conforming to the standard with respect to disallowed unicode characters?

SiriusStarr · October 21, 2019, 2:52am

Merely for my own edification, but hopefully someone can shed some light on it.

The ABNF in the language standard makes it pretty clear that certain unicode escape sequences are not valid:

; The parser must also reject Unicode escape sequences that are either:
;
; * Surrogate pairs (i.e. `%xD800-DFFF`)
; * Non-characters (i.e. `%xNFFFE-xNFFFF` for each `N` in `{ 0 .. F }`)

However, the dhall seems perfectly content to accept text that fails these rules, e.g.

$ dhall <<< '"\uD800"'
"�"
$ dhall <<< '"\uFFFE"'
""

(Note that there is a character not rendering in the second example above.)

Additionally, it seems to reject sequences that are supposedly accepted per the ABNF, e.g.

$ dhall <<< '"\u{1FFF0}"'
dhall: 
Error: Invalid input

(stdin):1:10:
  |
1 | "\u{1FFF0}"
  |          ^
Invalid Unicode code point

Additionally, surrounding the aforementioned cases with curly braces causes them to be rejected…

$ dhall <<< '"\u{D800}"'
dhall: 
Error: Invalid input

(stdin):1:9:
  |
1 | "\u{D800}"
  |         ^
Invalid Unicode code point

$ dhall <<< '"\u{FFFE}"'
dhall: 
Error: Invalid input

(stdin):1:9:
  |
1 | "\u{FFFE}"
  |         ^
Invalid Unicode code point

Just trying to understand what is going on here.

Gabriel439 · October 21, 2019, 2:52pm

@SiriusStarr: So the Haskell implementation accepting \uD800/\uFFFE is a case of a bug in the standard. The standard only forbids invalid codepoints for braced escape sequences:

github.com

dhall-lang/dhall-lang/blob/8098184d17c3aecc82674a7b874077a7641be05a/standard/dhall.abnf#L256-L262


; The parser must also reject Unicode escape sequences that are either:

;

; * Surrogate pairs (i.e. `%xD800-DFFF`)

; * Non-characters (i.e. `%xNFFFE-xNFFFF` for each `N` in `{ 0 .. F }`)

;

; See the `valid-non-ascii` rule for the exact ranges that are not allowed

unicode-escape = 4HEXDIG / "{" 1*HEXDIG "}"

… but it should also be forbidding them for non-braced escape sequences.

The case of the Haskell implementation rejecting \u{1FFF0} is a bug in the Haskell implementation, which uses the following check

github.com

dhall-lang/dhall-haskell/blob/ad443cd6851af215111e21d8ac92a9fb3cda7403/dhall/src/Dhall/Parser/Token.hs#L137-L142


-- | Returns `True` if the given `Char` is a valid Unicode codepoint
validCodepoint :: Char -> Bool
validCodepoint c =
    not (category == Char.Surrogate || category == Char.NotAssigned)
  where
    category = Char.generalCategory c

Data.Char does not contain a Unicode general category that matches only non-characters so I must have used Char.NotAssigned as an approximation, leading to that issue.

SiriusStarr · October 21, 2019, 5:55pm

Ahh, I see. So the “correct” behavior for an implementation is to always reject anything in D800-DFFF and the NFFFE-NFFFF ranges regardless of where it appears and accept everything else?

Gabriel439 · October 21, 2019, 6:07pm

@SiriusStarr: Yeah, that is correct. The set of valid code points should apply to both types of escape sequences and also to characters entered unescaped

sjakobi · November 1, 2019, 9:19pm

@SiriusStarr would you mind adding parser test cases that would reveal these bugs?

SiriusStarr · November 2, 2019, 12:01am

If you want to tell me how to generate the .dhallb files for things that currently fail to parse, sure, 'cause I am clueless. Or do you just mean for the things that don’t fail that should?

sjakobi · November 2, 2019, 5:45am

I think you could start with some accepted Dhall string, encode it with dhall encode --json, then manipulate the JSON to contain the interesting characters and convert it to CBOR/.dhallb with json2cbor.rb.