Issues with local caching and hashes of normal forms


(Andras Kovacs) #1

I’m currently working on a dhall-haskell branch in relation to my previous post here. Having looked at the implementation of importing, caching and hashing, I see a number of performance problems. I try to summarize them here and outline some solutions, which may or may not require standard changes.

Performance issues

One thing which puts a hard cap on scaling is storage of normal forms of programs and any sort of nontrivial computation on them. Type checking necessarily involves normalization of expressions which are types or occur in types, but in practice these are bounded. Programs on the other hand, are not practically size-bounded, and their normal forms are just massive. For this reason, proof assistants which support term normalization explicitly avoid to store or process normal forms.

The current standard includes storing normal forms in local caches of hashed imports, and also requires computing hashes on normal forms. Currently, this works reasonably on Prelude files, but fails for larger developments.

Normalization is not always avoidable: in the case of dhall-to-json, we want to eliminate language constructs that are not supported in JSON, and the simplest way to do this is normalization. This seems OK to me because JSON is an extremely simple language and normal forms are what we actually want in this case. In contrast, Nix as target language supports essentially everything in Dhall, so it makes sense to avoid normalization when compiling to Nix (we should just gather up resolved imports without unfolding anything, but this is another topic).

So, normalizing a particular expression should be avoided if possible, but note that the current standard requires normalizing imported expressions potentially many times. Consider a linear chain of hashed imports, with each imported expression also used linearly. Then we need to hash and cache all the normal forms, but they transitively contain the other imported normal forms. So we store and hash each normal form a quadratic number of times. I don’t think the standard requires this, but the Haskell implementation aggravates the situation by re-type-checking the normal form of every imported expression, even if the original imported expression is already known to be well-typed.

Potential solutions

First, I keep one current feature fixed, which is that every imported expression is type checked. This is practically feasible for large developments (if we have a fast implementation), and IMO this provides the bulk of security guarantees.

  1. Simply have a faster standard-compliant implementation, and assume that in Dhall’s domain, expressions will not get too large. This could be fine for all existing projects, but I don’t have much faith in the assumption that expression will not get too large.

  2. Hash the expressions themselves, not their normal forms. Propagate hashes through imports by hashing recursively. This is fast and works fine for detecting changed expressions behind imports. It does not a) detect changed implementation of normalization b) allow changed expressions which still have the same normal forms. For a), I don’t think it’s very interesting, because we use our local normalizer when we use imports, and the remote normalizer is not relevant as far as the imported expression is unchanged and well-typed, for b), this is also not very useful, because alpha-beta conversion (notion of conversion decidable by normalization) captures only a tiny part of the relevant notion of program equivalence. For example, if I replace a function in my library with a faster but equivalent one, almost always it’ll have a different normal form.

  3. Don’t hash or cache anything. This is simple and fast, but I think Option 2 is probably preferable. In an otherwise efficient implementation, with caching implemented via Option 2, caching basically only saves us time on downloading imports. So I note that caching is not likely to provide a dramatic performance boost, unless you have awful Internet connection.

Option 3 is currently implemented in the prototype, which can be used for benchmarking but lacks many things (inluding good error messages and every dhall command besides “dhall type” and “dhall”).


(Gabriel Gonzalez) #2

In contrast, Nix as target language supports essentially everything in Dhall, so it makes sense to avoid normalization when compiling to Nix (we should just gather up resolved imports without unfolding anything, but this is another topic).

Yeah, I think it’s fine to compile Dhall expressions directly to Nix without normalization if that is faster. If we can also translate Dhall’s import system to an equivalent Nix derivation that would be even better because then not even import resolution needs to be done at compilation time.

However, I think the most promising line of work here is @ocharles’s dhallix project to replace Nix entirely with Dhall (i.e. using Dhall to directly generate Nix derivations). The main blocker to this was the performance bottlenecks which you are fixing.

So, normalizing a particular expression should be avoided if possible, but note that the current standard requires normalizing imported expressions potentially many times . Consider a linear chain of hashed imports, with each imported expression also used linearly. Then we need to hash and cache all the normal forms, but they transitively contain the other imported normal forms. So we store and hash each normal form a quadratic number of times.

This seems like something that could be fixed and standardized. If the top-level import is protected by a semantic integrity check then there’s no need to verify transitive semantic integrity checks.

I don’t think the standard requires this, but the Haskell implementation aggravates the situation by re-type-checking the normal form of every imported expression, even if the original imported expression is already known to be well-typed.

Yeah, this behavior is not required by the standard at all. The standard only requires type-checking imported expressions (mainly to enforce that they have no free variables that might be accidentally captured by the surrounding expression).

The only reason that the Haskell implementation normalizes imported expressions is because at the time it was an optimization, but if it’s slowing things down in light of your new changes please feel free to disable that behavior! :slight_smile:

First, I keep one current feature fixed, which is that every imported expression is type checked

Yes, I agree that this is the only key feature that we need to preserve, mainly to avoid variable capture.

Hash the expressions themselves, not their normal forms.

I don’t think we should go that route, but I’d be willing to change my mind. One of the reasons we hash normal forms is so that normal-form-preserving refactorings (like adding/removing let expressions or renaming variables) don’t require regenerating semantic integrity checks. As you noted, it’s not perfect, but it’s better than nothing.

The reason I’m willing to change my mind is that the primary requirement behind semantic integrity checks is to prevent malicious tampering of imported expressions. So normalizing expressions before hashing them is not strictly a requirement to get that security guarantee. However, if you don’t αβ-normalize before hashing them then you will perturb the hash very frequently, and every time you disturb the hash the security-conscious user has to re-audit the normalized expression to verify that they still approve of the expression before freezing it again. If you add too many false positives to this process the user will stop auditing expressions, which defeats the security guarantee.

Don’t hash or cache anything

I would also prefer not to do that, mainly because the primary purpose behind semantic integrity checks is preventing malicious imports. I want to stress that caching expressions is only a secondary requirement of the semantic integrity check, which I would be willing to give up if it was actually harming performance. The primary reason for adding semantic integrity checks is securing imported expressions, and I’m fine taking an (opt-in) performance hit if I want to protect a remote import.


(Andras Kovacs) #3

Is the usage pattern where expressions change but normal forms don’t, common in the wild? I don’t have any such usage data myself. I would think that for something like Prelude, which mostly contains functions as opposed to simple definitions, changing expressions likely changes normal forms as well, but there could be other cases where people do lots of refactoring of imports and definitions without changing normal forms.

I think this comes down to which deters people more from using integrity checks: the performance cost of full normalization or the fragility of non-normal expression hashes. My intuition now is that the performance cost would be a bigger issue.

Alternatively, Dhall could support both kinds of hashes. This would also allow us to eventually get data about their relative popularity in practice.