Figuring out performance bottlenecks

We have an internal dhall library, which is pretty new and not that big (~1400 lines, not that heavily commented so that’s mostly code). It’s a “boilerplate” library, where you import it and (using dhall to-directory-tree) you can use a bunch of types and functions to generate various common CI files.

Over time it feels like it’s gotten slower than it should be, so I was hoping to dig in and see if there was an obvious culprit. For reference it’s taking 5s to simply load + typecheck, not counting any actual work. I’m testing this by reexporting Prelude, and running:

time dhall <<< '(./package.dhall).Prelude.Bool.not True'
False
real 0m5.140s
user 0m4.748s

The only external dependency is the prelude, which is frozen, so nothing’s hitting the network. And if I import the prelude import directly it only takes a fraction of that time:

time dhall <<< '(./dependencies/prelude.dhall).Bool.not True'
False
real 0m0.144s
user 0m0.119s

At a gut level, this feels wrong. We haven’t written that much code, and the prelude is something like 3-4x the number of lines. Admittedly ours is probably more dense (less comments, more complex types) than the Prelude, but it still feels excessive.

I’ve tried a few things, following hunches:

  • frozen all imports in all dhall files: no noticeable difference
  • tried to surgically comment out imports / functions

I’ve seen some effect getting the evaluation time down by commenting out various parts, but it’s very hard for me to tell if that’s because I’ve found a suspiciously expensive branch, or that code just happens to be bringing in a bunch of transitive expressions which ought to be expensive.

So, any suggestions as to how to continue? I’ve been trying to find a reproduction case I can share, but no real luck so far.

One remaining hunch I don’t know how to check is my suspicion around the JSON type. Since we’re using dhall-to-directory-tree we’ve got a lot of functions mapping custom types into JSON.Type, and perhaps with the way it’s encoding that recursion it’s having an outsize effect on type checking?

1 Like

Interesting!

Could you time dhall type --quiet on your package.dhall?

A profiled run would also be very interesting. I believe the Nix tooling also produces profiled binaries (or at least libraries), but I’m not sure how to access them…

1 Like

Similar time for just type --quiet. Interestingly, I noticed there’s a whole extra second taken when I interpret “an expression referencing the file” vs passing in the --file directly. That’s surprising to me…

time dhall type --quiet <<< ./package.dhall

real	0m4.804s
user	0m4.480s
time dhall type --quiet --file ./package.dhall

real	0m3.849s
user	0m3.561s

(I tried each variant 3 times, timings are consistent)

So there seems to be a general “wrapping” tax somewhere:

$ echo './package.dhall' > package2.dhall
$ time dhall type --quiet --file ./package2.dhall

real	0m4.807s
user	0m4.458s

$ echo './package2.dhall' > package3.dhall
$ time dhall type --quiet --file ./package3.dhall

real	0m5.793s
user	0m5.414s
1 Like

Do you already use dhall freeze --cache --all like the Prelude does?

I can provide some context that might help.

First off, local imports (even ones without integrity checks) are implicitly cached by the Haskell implementation. This is a non-standard feature specific to the Haskell implementation, but it has no affect on the observable behavior; it only affects performance. This implicit cache is stored underneath ~/.cache/dhall-haskell.

This is likely the reason why trivially wrapping the expression in another file that imports it changes performance, because this is triggering this implicit cache.

This is also the reason why I don’t believe dhall freeze --cache --all will improve things, because the Haskell implementation is already implicitly caching local imports, so it would likely not make a difference.

Usually when something like this happens it is because the interpreter is decoding a large normal form. One thing you should do is check how large the normal form is, like this:

$ dhall --file "${FILE}" | dhall encode | wc -c

For example, the Prelude’s normal form is ~182 KB when stored in the cache

$ dhall <<< 'https://prelude.dhall-lang.org/v16.0.0/package.dhall' | dhall encode | wc -c
  181982

The reason this matters is that the Haskell implementation appears to be slow when decoding CBOR (~5-10 MB/s decoding speed), so I suspect what is happening is that your normal form is about tens of megabytes large and it’s taking the interpreter a few seconds to decode all of that. For more context, see:

2 Likes
$ dhall --file "${FILE}" | dhall encode | wc -c
4727750

So, 25x the size of the prelude, apparently :grimacing:

Is there likely to be anything I can do to get this normal form down? Or is it strongly tied to the complexity / number of things exported?

@timbertson: Probably the easiest thing to do is to interpret the file and visually inspect the normal form and see if anything stands out.

Other than that, the main solution here is to improve decoding (5-10 MB/s for decoding CBOR was never fast enough in my opinion)

1 Like

It’s difficult to advise you without looking at your specific configuration, but one thing that made massive improvements in the size of our normal forms was to reduce our use of deeply nested unions. Remember that every time that you use a union, each union literal needs to include the types of all of the alternatives which it isn’t using at the time. If the types of the alternatives include unions themselves, and those unions include alternatives which are themselves unions, and so forth, then the top level union literal ends up including every single possible combination of deeply nested unions for every union literal.

We had a case where the unions were six levels deep… the size of the normal form suffered a combinatorial explosion.

Thanks for the context, folks. It’s obviously hard to eyeball the whole thing, but I did do it on some submodules and peeked at the most-surprisingly-biggest ones.

So I guess not type checking, but this does seem to be a likely culprit. Basically, if I jump to random locations in the normalized form of a heavy module, most of the time the screen is full of the JSON type.

As an illustrative example, I have a simple JSON helper:

let mapListToNull =
    -- map a List T into a JSON.array (or JSON.null if empty)
          \(T : Type)
      ->  \(fn : T -> JSON.Type)
      ->  \(values : List T)
      ->  nullIfEmpty (mapList T fn values)

Obviously this will include the definitions of nullIfEmpty and mapList which you can probably guess at, but that comes out to an encoded form of over 6kb. Meanwhile a whole module full of bash-script related types and templates is only 2kb.

The JSON type is pervasive elsewhere, too - due to the use of dhall to-directory-tree, basically every type has a toJSON companion function, and the repetitions of the JSON type takes up the vast majority of the normalized form.

@timbertson: The proliferation of */toJSON functions can be mitigated by standardizing this:

This is something that will probably keep coming up because normal forms duplicate things a lot. Would it make sense to imagine a binary encoding that includes hash-consing? That would take care of all of this redundancy in one go. But that’s of course quite complex to specify

2 Likes

Yes, it feels like there are three separate issues here:

  1. normal forms have a lot of redundancy
  2. CBOR preserves certain kinds of string (which is problematic for things like JSON/Type)
  3. dhall-haskell’s CBOR decoding is slow

I think that we need to fix both 1 and 3 (with 2 a nice-to-have but lower priority), in order to properly address the performance issues with fetching from cache.

2 Likes

@philandstuff: What CBOR decoding performance do you get for the Go implementation (i.e. in MB/s)?

I’ve just thrown together some naive-ish benchmarks and I get ~10Mb/s (on Prelude) and ~18Mb/s (on tests/parser/success/largeExpression.dhallb).

I haven’t put any effort into optimizing (or even measuring) this before, so no doubt there’s probably room for improvement.

1 Like

For reference: I’ve opened https://github.com/dhall-lang/dhall-haskell/issues/1804 to track the decoding performance of the Haskell implementation.

2 Likes

@timbertson Do you happen to use Prelude.JSON.render a lot? Its normalized form, and therefore its binary encoding, is absolutely massive:

$ echo 'https://prelude.dhall-lang.org/v16.0.0/JSON/render' | dhall --alpha | dhall encode | wc -c
148899

It’s 82% of the whole encoded Prelude!

$ echo 'https://prelude.dhall-lang.org/v16.0.0/package.dhall' | dhall --alpha | dhall encode | wc -c
181982

Curiously, renderYAML is much smaller:

$ echo 'https://prelude.dhall-lang.org/v16.0.0/JSON/renderYAML' | dhall --alpha | dhall encode | wc -c
12511

Maybe we can figure out what makes render so huge and try to get rid of that for some easy wins…

I already managed to shave 20% 37% off the encoded Prelude size with a small simplification in Prelude.JSON.renderAs: https://github.com/dhall-lang/dhall-lang/pull/1015

The irony of course is that that it was me who made JSON.render a bunch bigger (with no idea it’d have this effect) in https://github.com/dhall-lang/dhall-lang/pull/936 :man_facepalming:

I wouldn’t have expected it to be that big a difference, I’m using it in only a handful of places (at the edges of each file format) meanwhile the (smaller) JSON helpers are used all other the place.

But I ran your commit through the toplevel import and it went from ~7s -> ~4.5s, so that’s quite an improvement, thank you!

Yeah, I was surprised too by how much cache sizes are affected by how often an argument appears in a function body. Although it makes perfect sense when you think about how normalization works. This reminds me of the linear types proposal for GHC, which seems like a really curious connection.

If you normalize your mapListToNull function you’ll probably also find that the values argument is used at least twice: Once for a List/null check, once to produce the output JSON.array.

You can make that function “linear” in multiple ways: One way would be to replace the List/null check with List/unsnoc, somewhat like this:

merge
  { None -> JSON.null
  , Some -> \(ny : NonYtpme JSON.Type) -> JSON.array (ny.init # [ ny.last ])
  }
  (List/unsnoc JSON.Type (mapList T fn values))

Or you could try fusing List/unsnoc with mapList directly…

It’s all not very obvious though, and doesn’t generally improve the readability of the code, so I don’t think this should become standard practice.

I also don’t believe that you can get around duplicating type arguments in this way, so those will still cause a lot of bloat in the caches.

I think Gabriel has mentioned a possible new caching mechanism that doesn’t normalize expressions. That seems like a promising idea to me.

I’m also curious about @philandstuff’s idea to make normal forms less redundant. What did you have in mind there, @philandstuff?

As for redundancy, one could use compression. That would solve some but not all problems.