Dhall-haskell not conforming to the standard with respect to disallowed unicode characters?

Merely for my own edification, but hopefully someone can shed some light on it.

The ABNF in the language standard makes it pretty clear that certain unicode escape sequences are not valid:

; The parser must also reject Unicode escape sequences that are either:
;
; * Surrogate pairs (i.e. `%xD800-DFFF`)
; * Non-characters (i.e. `%xNFFFE-xNFFFF` for each `N` in `{ 0 .. F }`)

However, the dhall seems perfectly content to accept text that fails these rules, e.g.

$ dhall <<< '"\uD800"'
"�"
$ dhall <<< '"\uFFFE"'
""

(Note that there is a character not rendering in the second example above.)

Additionally, it seems to reject sequences that are supposedly accepted per the ABNF, e.g.

$ dhall <<< '"\u{1FFF0}"'
dhall: 
Error: Invalid input

(stdin):1:10:
  |
1 | "\u{1FFF0}"
  |          ^
Invalid Unicode code point

Additionally, surrounding the aforementioned cases with curly braces causes them to be rejected…

$ dhall <<< '"\u{D800}"'
dhall: 
Error: Invalid input

(stdin):1:9:
  |
1 | "\u{D800}"
  |         ^
Invalid Unicode code point

$ dhall <<< '"\u{FFFE}"'
dhall: 
Error: Invalid input

(stdin):1:9:
  |
1 | "\u{FFFE}"
  |         ^
Invalid Unicode code point

Just trying to understand what is going on here.

@SiriusStarr: So the Haskell implementation accepting \uD800/\uFFFE is a case of a bug in the standard. The standard only forbids invalid codepoints for braced escape sequences:

… but it should also be forbidding them for non-braced escape sequences.

The case of the Haskell implementation rejecting \u{1FFF0} is a bug in the Haskell implementation, which uses the following check

Data.Char does not contain a Unicode general category that matches only non-characters so I must have used Char.NotAssigned as an approximation, leading to that issue.

Ahh, I see. So the “correct” behavior for an implementation is to always reject anything in D800-DFFF and the NFFFE-NFFFF ranges regardless of where it appears and accept everything else?

@SiriusStarr: Yeah, that is correct. The set of valid code points should apply to both types of escape sequences and also to characters entered unescaped

1 Like

@SiriusStarr would you mind adding parser test cases that would reveal these bugs?

If you want to tell me how to generate the .dhallb files for things that currently fail to parse, sure, 'cause I am clueless. Or do you just mean for the things that don’t fail that should?

I think you could start with some accepted Dhall string, encode it with dhall encode --json, then manipulate the JSON to contain the interesting characters and convert it to CBOR/.dhallb with json2cbor.rb.