Text manipulation functions

Where can I find text manipulation functions besides Prelude.Text?

In particular I’m looking for a way to change "foo-bar" into "foo_bar".

Text is currently opaque, you can’t really manipulate it except with ++ and Text/show.

If you want to join "foo" and "bar" in different ways, maybe keep them as a pair and concatenate as needed. Or store them as unions with different renderers.

2 Likes

Unfortunately that’s not what I want to do. For a Compose file I want to create the following entry (YAML)

service:
  the-service:
    image: "org/image:${image_TAG:-latest}"

so it’s possible to run image_TAG=my-local-build docker-compose up to override the default tag for the service in question. Unfortunately we have some images with names like foo-bar, and docker-compose doesn’t allow the - in there. We even have some images with names like foo-bar-baz, so I really want to replace - by _ in the string itself.

I think the best option for converting kebab case to snake case would be to ask for the function upstream, perhaps?

Well, Dhall doesn’t support that right now.

Where do the image names come from though? Could they be defined as List Text instead of Text, so you can Prelude.Text.concatSep "_" them for docker-compose?

Maybe we should consider adding a Text/split : Text -> Text -> List Text function, kind of the inverse of Prelude.Text.concatSep. So you’d have

Prelude.Text.concatSep sep (Text/split sep text) == text

Specifying it might be tricky though. I’m not sure…

1 Like

Actually it would probably better to return a NonEmptyList, so you statically know that it has at least one element. The property would then be something like

Prelude.Text.concatSep sep (Prelude.List.NonEmpty.toList Text (Text/split sep text)) == text

Part of the reason I’ve dragged my feet on supporting Text manipulation built-ins is that it requires proper handling of Unicode, which is challenging to standardize even for operations that appear benign.

For example, consider a simpler utility like Text/truncate : Natural → Text → Text, which has been requested a few times. Correctly truncating a Text literal in a Unicode-aware way is non-trivial and these slides do a good job of illustrating the complexity involved:

https://hoytech.github.io/truncate-presentation/

4 Likes

Well, I take it that only option at the moment is to write a small post-processor of the created YAML that picks out the offending pieces and transforms them. Not the nicest of solutions, but OK.

1 Like

Many thanks for the link to those slides!

I wonder what we want to expect from any Dhall text-manipulation builtins though? Ideally we’d allow manipulations at the grapheme cluster level of course, but that’s more than e.g. the Haskell standard libraries offer AFAIK.

Instead, I think it would probably cover 99% of use cases if we’d avoid breaking multi-byte code-points.

In that case, it actually seems fairly easy to specify Text/split : Text -> Text -> List Text for UTF8-encoded strings: You can simply iterate byte by byte through the string to be searched, and check whether the bytes of the “needle” match up with the bytes at the current byte offset. I believe you wouldn’t even need to understand how different code-points are encoded with different numbers of bytes.

Text/split could then be used as a building block for other functions for Text, for example Text/replace. Text/equal would probably also require a Text/null builtin.

Another idea I had was that we could expose the code-points of a Text with functions

Text/toCodePoints : Text -> List Natural
Text/fromCodePoints : List Natural -> Optional Text

Any text manipulations could then be implemented in pure Dhall. They would probably be rather inefficient though, so maybe that’s not a good idea.

The Text/toCodePoints method would also enables a Text/equal implementation. I think that is a useful function, but that would contradict https://docs.dhall-lang.org/discussions/Design-choices.html#text-manipulation

This article makes the case that we should be operating in terms of grapheme clusters instead of code points:

… so the hierarchy really should be:

Text ↔ List GraphemeCluster

GraphemeCluster ↔ List Natural -- Code points

… but I think it would be more ergonomic if we were to inline all of the intermediate types to get:

Text/unpack : Text → List (List Natural)

Text/pack : List (List Natural) → Text

I believe those two primitives (plus a Text/normalize : Text → Text) would essentially permit any operation that we desire (albeit inefficiently). However, I think the real benefit of having those two primitives is that we could standardize the behavior of more efficient primitives (e.g. fast truncate or fast replace) in terms of how they would be implemented using those inefficient primitives.

3 Likes

I process text for a living.

The main thing to realize about text processing is that the only reason ever to decompose text into characters is if your programming language forces you to jump through that hoop. If you look at actual languages designed for text processing (Icon, SNOBOL, OmniMark, …) you’ll find they just treat text as a primitive type. It’s general-purpose languages like C that introduced the idea that text should be treated as an array of characters. Their goal was to be minimalistic and to avoid introducing another primitive, but the consequence is that a generation or two of programmers was raised on the idea that the only way to manipulate text is to first decompose it into characters.

There are no characters. There’s only text, and the smallest part of text is still text. The only time you need to know about Unicode code points that make up the text is if you’re doing a hex dump or encoding the text into a binary form like UTF-8. A Text->Text function should never need to go through Natural.

Sorry about the diatribe. To make it constructive, the primitives you may want in the API are

Text/++ : Text -> Text -> Text
Text/empty : Text
Text/equal : Text → Text → Bool
Text/lexicographicalOrder : Text → Text → Ordering
Text/split : Text → Text → List Text
Text/graphemeClusters : Text → List Text
Text/upperCase : Text → Text
Text/lowerCase : Text → Text
Text/capitalCase : Text → Text
Text/toCodePoints : Text → List Natural
Text/fromCodePoints : List Natural → Text
6 Likes

Wouldn’t some Naturals represent invalid code points?! In that case the types would be

Text/pack : List (List Natural) → Optional Text

or

Text/fromCodePoints : List Natural → Optional Text

…depending on what we decide to implement.

@sjakobi: Yeah, that’s a good point

So assuming that we agree with @blamario that we should focus on high-level primitives (i.e. ones that are not based on individual characters), there are basically three possible options to choose from:

  • Make Text “transparent”

    e.g. add a Text/split : Text → Text → Text primitive and then we could implement the original request as:

    Prelude.Text.concatSep "_" (Text/split "-" name)
    
  • Add Text transformations that don’t enable introspection

    i.e. instead of Text/split we add built-ins like Text/replace : Text → Text → Text → Text, which does not permit introspection

    Some other hypothetical primitives that might fall into this category are:

    • Text/take : Natural → Text → Text
    • Text/escapeXML : Text → Text
  • Continue to keep Text opaque

2 Likes

The best choice depends on the ultimate use cases. That being said, the first option does not really make Text transparent or enable introspection in the same way that Haskell’s accursed type String = [Char] does. I can’t think of any serious harm it could cause in the long term. It could perhaps become obsoleted by more primitive functions in future, but that doesn’t seem like a heavy burden. If nobody else can think of a more serious problem, that would be my default choice.

Again it depends on the desired end goal, but one possible design is to gradually add primitives that both

  • keep Text opaque but decompose it into smaller Text values, and
  • can be used to eventually build a parser combinator library.

The first property ensures that no door to an alternative approach is closed, the second that every possible use case is eventually covered. The only primitives you really need to build a decently-performing, general-purpose combinator library are

  • stripPrefix :: Text -> Text -> Maybe Text
  • splitAt :: Natural -> Text -> (Text, Text)
  • span :: (Text -> Bool) -> Text -> (Text, Text)
  • A family of Text -> Bool functions, such as startsWithLetter, startsWithNumber, etc.

The main downsides stem from Dhall not being Haskell. Without user-defined operators and do-notation the parsers would not look as nice. I’m not sure if the type system is up to the task either. And finally, while the result would be completely general-purpose and pretty fast, for any specific use case it would still be slower than a dedicated function like split.

1 Like

I think an important question (which I think you were hinting at @Gabriel439) is whether we want to support text equality or even substring checking (like Text/hasSubString : Text -> Text -> Bool).

I think that once we have one of these features (enabled for example via Text/split : Text -> Text -> List Text), users may use that instead of properly modelling their domain with unions, thereby reducing clarity and ultimately maintainability of their configurations.

So I think we ought to be be cautious about allowing this kind of Text introspection. From this perspective it would be safer to go with “non-introspective” operations like Text/replace or Text/take.

(These considerations are obviously inspired by one of your blogposts: http://www.haskellforall.com/2016/04/worst-practices-should-be-hard.html)

@sjakobi: Yeah, that’s where I was going with that distinction about introspection. As a simple example, if we provided a Text/split built-in then users might choose to model lists as comma-separated unquoted values. In other words instead of this:

[ "foo", "bar", "baz" ]

… they might try to create a Text DSL where they ask the user to instead supply:

"foo, bar, baz"

… and that DSL would be vulnerable to mistakes like elements containing commas in their names.