Text manipulation functions

Gabriel439 · March 6, 2020, 4:47pm

I forgot to mention that besides being error-prone it would deteriorate discoverability due to being weakly typed, as a user would not be able to infer what to supply for the Text value since the type does not suggest that it expects an internal structure of comma-separated values.

blamario · March 6, 2020, 6:42pm

That is almost a philosophical choice. As long as you’re aware of its consequences, it’s hard to argue against. My own design philosophy is to give the tools to the developer, even if they can be used to construct a gun and shoot their foot. Mind you, if there is a way to make that bad outcome less likely, and good outcomes more, of course I’ll take it. The choice is rarely that clear.

Now about those consequences. Your comma-separated list is an easy example to disallow, but what are you going to do about the established structured strings that are not a developer’s whim? The prime examples are file paths, dates and times. If a user wishes to get a parent directory for a given path, or the year of a given date, you have four options:

provide text-splitting primitives,
add FilePath and Date types to the language with the appropriate operations,
tell the user to provide the directory and year as separate inputs, or
send them away.

You seem to be arguing for option #3, but that’s going to feel like #4 for many users, if not most of them. Option #2 is technically the safest and most correct one, but – please correct me if I’m wrong – it’s way too complex for Dhall. So really the choice is #1 or #4, adding the text-splitting primitives or refusing to support a large subset of potential users.

blamario · March 6, 2020, 7:22pm

I had an idea. Allow me to add another option to my list:

1.5. Add the ability to declare structured string types, such as Date or FilePath, at the I/O boundary only

Here’s an example:

let Date = {year : Natural,
            month : Natural,
            day : Natural}
let DateYMD : Type = Date as Text separated with "-"
let today : DateYMD = "2020-03-06"
in today.year

This way all text introspection happens at input time, and there’s no way it can be abused within the program. It’s in keeping with another good blog post.

Gabriel439 · March 6, 2020, 7:49pm

@blamario: Yeah, I like that idea. I had a similar idea here (in the context of importing JSON): https://github.com/dhall-lang/dhall-lang/issues/121#issuecomment-511955678

ari-becker · March 7, 2020, 5:08pm

@blamario reminds me more of the user-defined grammars issue, which is maybe the issue which @Gabriel439 was thinking about in the #121 comment.

blamario · March 7, 2020, 7:07pm

My idea was really only about minimal support for records and lists represented as separated strings. What @Gabriel439 was hinting at seems more like full text-parsing support that’s constrained to I/O. I like his idea even better in principle, but I’d like to see more detail.

Starting from the ./someImport.lang as ./someGrammar.dhall syntax, I’d like to see clarified:

What is the language available inside someGrammar.dhall? How does it specify a grammar?
- Does it have some text-parsing primitives available, like stripPrefix etc. I outlined above? If so, how are they made available there but not in regular Dhall? Is that reflected in the type of someGrammar?
- Following on the last thought, there could be a built-in Grammar type that’s basically an applicative functor or even a monad. It would come with a number of primitive constructors and combinators that can appear anywhere. The only way to apply a Grammar, however, would be the as keyword. In this design there would be no stripPrefix : Text -> Text -> Maybe Text, only matchPrefix : Text -> Grammar ().
Note that my weaker idea of record as Text separated by can be easily extended to its inverse text as Record separated by. Would a grammar specification also be bi-directional? In other words, would there be a way to serialize a Dhall into a string according to a grammar, such as syntax value as text of ./someGrammar.dhall? If value is constant, what would be the normal form of this?
Would a string literal be allowed on the left-hand side on as? For example, would "2020-03-07" as Date be legal? How about arbitrary Text expressions?
The right-hand side of as, ignoring the design-imposed constraints for a moment, is really nothing more than a function of type Text -> a. Could these functions be composed? For example, ./myFile.json.gz as MyNormalizer . JSON . UTF8 . GZIP?

Gabriel439 · March 7, 2020, 11:56pm

@blamario:

I haven’t really thought this through, but the rough idea I had in mind was that the ./grammar.dhall expression would be an ordinary Dhall expression of type Text → Optional A or Text → < Error : Text | Result : A > with access to additional Text introspection built-ins.
The grammar does not need to be bidirectional. Nobody has requested this that I know of
You could permit arbitrary expressions instead of restricting this to just imports, but this wouldn’t change anything. The reason why is that imports are type-checked with an empty context, so they can’t refer to values in scope. So, for example, an expression like λ(x : Text) → x as Date would be a type error because the subexpression x would be type-checked with an empty context where the bound variable x was no longer in scope. That prevents the as ./grammar.dhall mechanism from being used as a Text introspection backdoor.
Presumably the right-hand-side could be an arbitrary Dhall expression, so grammars are composable insofar as Dhall expressions are composable

blamario · March 9, 2020, 12:52am

That would probably be the shortest way to get something in working order, but how do you distinguish between Dhall expressions that have access to the Text introspection built-ins and those that don’t? I mean, the ./grammar.dhall file by itself is not on the right-hand side of any as. Would it be considered legal by itself? What would be the output of dhall <<< ./grammar.dhall?

Perhaps not yet, but for any json-to-dhall there is a dhall-to-json. Any format important enough to be imported in its native form will probably be important enough to be exported as well. However that can be accomplished with a separate pretty-printer, and your answer to #1 precludes any more unified solution.

Gabriel439 · March 9, 2020, 1:58pm

@blamario: One way to distinguish a file that depends on additional builtins would be this idea:

joneshf · March 12, 2020, 2:49am

What about choosing an actual grammar as the grammar type? Maybe start with something like WSN or BNF? Is that too meta?

blamario · March 12, 2020, 4:52pm

For one thing, BNF and WSN by themselves wouldn’t specify the mapping between the text input and the Dhall value. We could extend them with appropriate constructs, but that would be a new grammar formalism. You’d probably want to design it from scratch to make it as close to Dhall as possible.

Profpatsch · March 16, 2020, 10:55am

Instead of text equality, I would really like something like an open sum type instead.

Because I think most people arguing for Text/equal (also lol what does that even mean) really want open sum types with an equality on the constructors, like symbols in lisp.

Then there’s the Text/split and Text/lowercase camp, but

Text/split is going to make people implement string parsing algorithms again, which leads to “oh no, this language totally not made for this is slow, I need more primitives to make it faster”.
I will refer to https://github.com/mozilla/nixpkgs-mozilla/blob/master/lib/parseTOML.nix as an example (Fun fact: to enable that, a tokenizer builtin was added to nix, but the parser was still horribly slow obviously, so in the end they fixed it by adding the builtins.fromTOML builtin).

Text/lowercase has the same problems as Text/equal: lol, what does that even mean

blamario · March 17, 2020, 9:02pm

Glad you asked: https://www.unicode.org/reports/tr15/
For Text/compare there’s https://www.unicode.org/reports/tr10/

That would be http://www.unicode.org/L2/L1999/99190.htm
Mind you, what people usually want to do with Text/lowercase and like is case-insensitive comparison, and that’s better done directly.

Profpatsch · March 26, 2020, 9:44pm

I do hope you are joking.

Gabriel439 · March 27, 2020, 8:23pm

@Profpatsch: Please try to avoid phrasing things in a confrontational way. If you disagree, address the substance of what they are saying instead of appealing to ridicule.

Also, I do believe there is a sensible way to implement Text/equal. The document that @blamario linked to might appear imposing, but it essentially boils down to: two Text values are equal if their normal forms are equal, and normal forms are well-understood for Unicode (any mainstream language with a Unicode library will support this)

The reason we haven’t yet added Text/equal isn’t because of concerns over Unicode compliance, but rather due to trying to design the language to work without it as long as possible.

Profpatsch · April 2, 2020, 4:21am

For reference, here’s the Eq function for the built-in, guaranteed-to-be-utf8 str in rust

    /// Implements ordering of strings.
    ///
    /// Strings are ordered  lexicographically by their byte values. This orders Unicode code
    /// points based on their positions in the code charts. This is not necessarily the same as
    /// "alphabetical" order, which varies by language and locale. Sorting strings according to
    /// culturally-accepted standards requires locale-specific data that is outside the scope of
    /// the `str` type.
    #[stable(feature = "rust1", since = "1.0.0")]
    impl Ord for str {
        #[inline]
        fn cmp(&self, other: &str) -> Ordering {
            self.as_bytes().cmp(other.as_bytes())
        }
    }

    #[stable(feature = "rust1", since = "1.0.0")]
    impl PartialEq for str {
        #[inline]
        fn eq(&self, other: &str) -> bool {
            self.as_bytes() == other.as_bytes()
        }
        #[inline]
        fn ne(&self, other: &str) -> bool {
            !(*self).eq(other)
        }
    }

I don’t see the collation algorithms coming with the rust standard library.

Haskell’s Text type also does bytewise comparison as far as I can see.

From the collation standard:

Also, defaults in CLDR data may vary by locale. For example, normalization is turned off in most CLDR locales (those that don’t normally use multiple accents). The default for strength in UCA is tertiary ; it can be changed for different locales in CLDR.

Another point: the static dhall binary is about 3MB right now. How big are the collation data tables? I assume they will have to be embedded?

Profpatsch · April 2, 2020, 4:39am

Here’s a discussion about unicode normalization in the rust stdilb: https://internals.rust-lang.org/t/rust-unicode/2730/18

There was also a very good, pretty exhaustive blog post somewhere, but I can’t find it right now.

Profpatsch · April 2, 2020, 4:44am

Here’s a good comment from that thread:

As the original author of the NF(K){C,D} implementations that were originally in libunicode and are now in the unicode-normalization crate I feel obliged to chime in here.
I think performing normalization as part of PartialEq::eq to “fix” string equality would be misguided. Different applications want different normalization, and (as others have said) thinking normalization on its own is sufficient is naive in many cases.

blamario · April 4, 2020, 2:53pm

My answer was indeed somewhat tongue-in-cheek. Sorry I missed the ensuing discussion, which I am quite glad to see.

I actually agree with the Rust conclusion that Dhall’s default string equality should not include full Unicode normalization. The discussion you linked already brings up the complexity argument and the fact that many applications need exact (byte-wise or code-point-wise) equality. To me the crucial point is that Unicode defines the (already linked) canonical normalizations that together with the primitive exact equality give you a canonical equality. I’m not sure if the relationship between code-point-wise and lexicographical ordering is as neat, but we can cross that bridge when we get there.

My larger point is that people have already thought through these questions, and they have provided clear answers (down to implementation algorithms) where possible. Every programming language designed in this century should follow the Unicode recommendations. Yes they’re complex (because writing is complex) but that’s no reason to give up.

Gabriel439 · May 1, 2020, 9:53pm

I created an issue to track the idea of parsing Text at import resolution time similar to what @blamario proposed: https://github.com/dhall-lang/dhall-lang/issues/989

I know it’s not exactly germane to the original request of truncating Text, which we may still address in a separate way (perhaps a new Text/truncate built-in), but I just wanted to mention it briefly here since the topic had come up yet again recently.