Why does term order not impact semantic hashes?

I found the concept of these semantic hashes interesting. I was wondering how the normalization ensures that these are always the same, even if I change the order of elements within the dhall file.

I wanted to use semantic hashes in a project of mine where I manipulate data. It would be a nice feature to ensure that refactoring my DSL keeps the semantics the same.

My DSL can be used to do calculations on data. Let’s say I read a CSV and average column col1. This gives hash A. But if I rename column col1 to col2 and then average col2, the same hash should be generated.

Or, given that every row in a dataset has the same number of entries, then the average of each row summed is the same as the sum of each row and then that number is averaged.

Edit: I just noticed something else. Even if Dhall creates some object with certain keys, the binary serialization needs to serialize each field in the same order, always, to ensure that the same hash is generated. Is that assumption true?

Basically the way that semantic hashing works is that you take an expression and apply the following steps:

  • β-normalize the expression

    This is what we normally think of as “evaluation”, but it also includes sorting the fields of records.

  • α-normalize the expression

    This basically renames all variables to _ and uses de Bruijn indices to distinguish them.

  • Convert the expression to a standard CBOR encoding

  • Compute the SHA256 hash of the bytes encoding that expression

So if two different Dhall expressions evaluate to the same result then they’ll have the same semantic hash.

You can learn more by studying the language standard which has all the gory details. In particular, the part on computing the semantic integrity check is in the section on Import resolution and in particular it’s this bit:

e₁ ⇥ e₂
e₂ ↦ e₃
encode(e₃) = binary
sha256(binary) = byteHash

It’s also worth noting that the first three steps (β-normalization, α-normalization, and binary encoding) are the exact same three steps used for checking equivalence, so if two Dhall expressions are equivalent then they will have the same semantic hash.