I’m currently working on a
dhall-haskell branch in relation to my previous post here. Having looked at the implementation of importing, caching and hashing, I see a number of performance problems. I try to summarize them here and outline some solutions, which may or may not require standard changes.
One thing which puts a hard cap on scaling is storage of normal forms of programs and any sort of nontrivial computation on them. Type checking necessarily involves normalization of expressions which are types or occur in types, but in practice these are bounded. Programs on the other hand, are not practically size-bounded, and their normal forms are just massive. For this reason, proof assistants which support term normalization explicitly avoid to store or process normal forms.
The current standard includes storing normal forms in local caches of hashed imports, and also requires computing hashes on normal forms. Currently, this works reasonably on Prelude files, but fails for larger developments.
Normalization is not always avoidable: in the case of
dhall-to-json, we want to eliminate language constructs that are not supported in JSON, and the simplest way to do this is normalization. This seems OK to me because JSON is an extremely simple language and normal forms are what we actually want in this case. In contrast, Nix as target language supports essentially everything in Dhall, so it makes sense to avoid normalization when compiling to Nix (we should just gather up resolved imports without unfolding anything, but this is another topic).
So, normalizing a particular expression should be avoided if possible, but note that the current standard requires normalizing imported expressions potentially many times. Consider a linear chain of hashed imports, with each imported expression also used linearly. Then we need to hash and cache all the normal forms, but they transitively contain the other imported normal forms. So we store and hash each normal form a quadratic number of times. I don’t think the standard requires this, but the Haskell implementation aggravates the situation by re-type-checking the normal form of every imported expression, even if the original imported expression is already known to be well-typed.
First, I keep one current feature fixed, which is that every imported expression is type checked. This is practically feasible for large developments (if we have a fast implementation), and IMO this provides the bulk of security guarantees.
Simply have a faster standard-compliant implementation, and assume that in Dhall’s domain, expressions will not get too large. This could be fine for all existing projects, but I don’t have much faith in the assumption that expression will not get too large.
Hash the expressions themselves, not their normal forms. Propagate hashes through imports by hashing recursively. This is fast and works fine for detecting changed expressions behind imports. It does not a) detect changed implementation of normalization b) allow changed expressions which still have the same normal forms. For a), I don’t think it’s very interesting, because we use our local normalizer when we use imports, and the remote normalizer is not relevant as far as the imported expression is unchanged and well-typed, for b), this is also not very useful, because alpha-beta conversion (notion of conversion decidable by normalization) captures only a tiny part of the relevant notion of program equivalence. For example, if I replace a function in my library with a faster but equivalent one, almost always it’ll have a different normal form.
Don’t hash or cache anything. This is simple and fast, but I think Option 2 is probably preferable. In an otherwise efficient implementation, with caching implemented via Option 2, caching basically only saves us time on downloading imports. So I note that caching is not likely to provide a dramatic performance boost, unless you have awful Internet connection.
Option 3 is currently implemented in the prototype, which can be used for benchmarking but lacks many things (inluding good error messages and every dhall command besides “dhall type” and “dhall”).