Text manipulation functions

@sjakobi: Yeah, that’s a good point

So assuming that we agree with @blamario that we should focus on high-level primitives (i.e. ones that are not based on individual characters), there are basically three possible options to choose from:

  • Make Text “transparent”

    e.g. add a Text/split : Text → Text → Text primitive and then we could implement the original request as:

    Prelude.Text.concatSep "_" (Text/split "-" name)
    
  • Add Text transformations that don’t enable introspection

    i.e. instead of Text/split we add built-ins like Text/replace : Text → Text → Text → Text, which does not permit introspection

    Some other hypothetical primitives that might fall into this category are:

    • Text/take : Natural → Text → Text
    • Text/escapeXML : Text → Text
  • Continue to keep Text opaque

2 Likes

The best choice depends on the ultimate use cases. That being said, the first option does not really make Text transparent or enable introspection in the same way that Haskell’s accursed type String = [Char] does. I can’t think of any serious harm it could cause in the long term. It could perhaps become obsoleted by more primitive functions in future, but that doesn’t seem like a heavy burden. If nobody else can think of a more serious problem, that would be my default choice.

Again it depends on the desired end goal, but one possible design is to gradually add primitives that both

  • keep Text opaque but decompose it into smaller Text values, and
  • can be used to eventually build a parser combinator library.

The first property ensures that no door to an alternative approach is closed, the second that every possible use case is eventually covered. The only primitives you really need to build a decently-performing, general-purpose combinator library are

  • stripPrefix :: Text -> Text -> Maybe Text
  • splitAt :: Natural -> Text -> (Text, Text)
  • span :: (Text -> Bool) -> Text -> (Text, Text)
  • A family of Text -> Bool functions, such as startsWithLetter, startsWithNumber, etc.

The main downsides stem from Dhall not being Haskell. Without user-defined operators and do-notation the parsers would not look as nice. I’m not sure if the type system is up to the task either. And finally, while the result would be completely general-purpose and pretty fast, for any specific use case it would still be slower than a dedicated function like split.

1 Like

I think an important question (which I think you were hinting at @Gabriel439) is whether we want to support text equality or even substring checking (like Text/hasSubString : Text -> Text -> Bool).

I think that once we have one of these features (enabled for example via Text/split : Text -> Text -> List Text), users may use that instead of properly modelling their domain with unions, thereby reducing clarity and ultimately maintainability of their configurations.

So I think we ought to be be cautious about allowing this kind of Text introspection. From this perspective it would be safer to go with “non-introspective” operations like Text/replace or Text/take.

(These considerations are obviously inspired by one of your blogposts: http://www.haskellforall.com/2016/04/worst-practices-should-be-hard.html)

@sjakobi: Yeah, that’s where I was going with that distinction about introspection. As a simple example, if we provided a Text/split built-in then users might choose to model lists as comma-separated unquoted values. In other words instead of this:

[ "foo", "bar", "baz" ]

… they might try to create a Text DSL where they ask the user to instead supply:

"foo, bar, baz"

… and that DSL would be vulnerable to mistakes like elements containing commas in their names.

I forgot to mention that besides being error-prone it would deteriorate discoverability due to being weakly typed, as a user would not be able to infer what to supply for the Text value since the type does not suggest that it expects an internal structure of comma-separated values.

1 Like

That is almost a philosophical choice. As long as you’re aware of its consequences, it’s hard to argue against. My own design philosophy is to give the tools to the developer, even if they can be used to construct a gun and shoot their foot. Mind you, if there is a way to make that bad outcome less likely, and good outcomes more, of course I’ll take it. The choice is rarely that clear.

Now about those consequences. Your comma-separated list is an easy example to disallow, but what are you going to do about the established structured strings that are not a developer’s whim? The prime examples are file paths, dates and times. If a user wishes to get a parent directory for a given path, or the year of a given date, you have four options:

  1. provide text-splitting primitives,
  2. add FilePath and Date types to the language with the appropriate operations,
  3. tell the user to provide the directory and year as separate inputs, or
  4. send them away.

You seem to be arguing for option #3, but that’s going to feel like #4 for many users, if not most of them. Option #2 is technically the safest and most correct one, but – please correct me if I’m wrong – it’s way too complex for Dhall. So really the choice is #1 or #4, adding the text-splitting primitives or refusing to support a large subset of potential users.

2 Likes

I had an idea. Allow me to add another option to my list:

1.5. Add the ability to declare structured string types, such as Date or FilePath, at the I/O boundary only

Here’s an example:

let Date = {year : Natural,
            month : Natural,
            day : Natural}
let DateYMD : Type = Date as Text separated with "-"
let today : DateYMD = "2020-03-06"
in today.year

This way all text introspection happens at input time, and there’s no way it can be abused within the program. It’s in keeping with another good blog post.

1 Like

@blamario: Yeah, I like that idea. I had a similar idea here (in the context of importing JSON): https://github.com/dhall-lang/dhall-lang/issues/121#issuecomment-511955678

1 Like

@blamario reminds me more of the user-defined grammars issue, which is maybe the issue which @Gabriel439 was thinking about in the #121 comment.

1 Like

My idea was really only about minimal support for records and lists represented as separated strings. What @Gabriel439 was hinting at seems more like full text-parsing support that’s constrained to I/O. I like his idea even better in principle, but I’d like to see more detail.

Starting from the ./someImport.lang as ./someGrammar.dhall syntax, I’d like to see clarified:

  1. What is the language available inside someGrammar.dhall? How does it specify a grammar?

    • Does it have some text-parsing primitives available, like stripPrefix etc. I outlined above? If so, how are they made available there but not in regular Dhall? Is that reflected in the type of someGrammar?
    • Following on the last thought, there could be a built-in Grammar type that’s basically an applicative functor or even a monad. It would come with a number of primitive constructors and combinators that can appear anywhere. The only way to apply a Grammar, however, would be the as keyword. In this design there would be no stripPrefix : Text -> Text -> Maybe Text, only matchPrefix : Text -> Grammar ().
  2. Note that my weaker idea of record as Text separated by can be easily extended to its inverse text as Record separated by. Would a grammar specification also be bi-directional? In other words, would there be a way to serialize a Dhall into a string according to a grammar, such as syntax value as text of ./someGrammar.dhall? If value is constant, what would be the normal form of this?

  3. Would a string literal be allowed on the left-hand side on as? For example, would "2020-03-07" as Date be legal? How about arbitrary Text expressions?

  4. The right-hand side of as, ignoring the design-imposed constraints for a moment, is really nothing more than a function of type Text -> a. Could these functions be composed? For example, ./myFile.json.gz as MyNormalizer . JSON . UTF8 . GZIP?

1 Like

@blamario:

  1. I haven’t really thought this through, but the rough idea I had in mind was that the ./grammar.dhall expression would be an ordinary Dhall expression of type Text → Optional A or Text → < Error : Text | Result : A > with access to additional Text introspection built-ins.

  2. The grammar does not need to be bidirectional. Nobody has requested this that I know of

  3. You could permit arbitrary expressions instead of restricting this to just imports, but this wouldn’t change anything. The reason why is that imports are type-checked with an empty context, so they can’t refer to values in scope. So, for example, an expression like λ(x : Text) → x as Date would be a type error because the subexpression x would be type-checked with an empty context where the bound variable x was no longer in scope. That prevents the as ./grammar.dhall mechanism from being used as a Text introspection backdoor.

  4. Presumably the right-hand-side could be an arbitrary Dhall expression, so grammars are composable insofar as Dhall expressions are composable

That would probably be the shortest way to get something in working order, but how do you distinguish between Dhall expressions that have access to the Text introspection built-ins and those that don’t? I mean, the ./grammar.dhall file by itself is not on the right-hand side of any as. Would it be considered legal by itself? What would be the output of dhall <<< ./grammar.dhall?

Perhaps not yet, but for any json-to-dhall there is a dhall-to-json. Any format important enough to be imported in its native form will probably be important enough to be exported as well. However that can be accomplished with a separate pretty-printer, and your answer to #1 precludes any more unified solution.

@blamario: One way to distinguish a file that depends on additional builtins would be this idea:

What about choosing an actual grammar as the grammar type? Maybe start with something like WSN or BNF? Is that too meta?

For one thing, BNF and WSN by themselves wouldn’t specify the mapping between the text input and the Dhall value. We could extend them with appropriate constructs, but that would be a new grammar formalism. You’d probably want to design it from scratch to make it as close to Dhall as possible.

Instead of text equality, I would really like something like an open sum type instead.

Because I think most people arguing for Text/equal (also lol what does that even mean) really want open sum types with an equality on the constructors, like symbols in lisp.

Then there’s the Text/split and Text/lowercase camp, but

  • Text/split is going to make people implement string parsing algorithms again, which leads to “oh no, this language totally not made for this is slow, I need more primitives to make it faster”.
    I will refer to https://github.com/mozilla/nixpkgs-mozilla/blob/master/lib/parseTOML.nix as an example (Fun fact: to enable that, a tokenizer builtin was added to nix, but the parser was still horribly slow obviously, so in the end they fixed it by adding the builtins.fromTOML builtin).
  • Text/lowercase has the same problems as Text/equal: lol, what does that even mean

Glad you asked: https://www.unicode.org/reports/tr15/
For Text/compare there’s https://www.unicode.org/reports/tr10/

That would be http://www.unicode.org/L2/L1999/99190.htm
Mind you, what people usually want to do with Text/lowercase and like is case-insensitive comparison, and that’s better done directly.

I do hope you are joking.