CBOR format uses short bytestrings for builtins

vmchale · September 12, 2020, 9:20pm

Right now, the CBOR encoding uses strings for builtins. I assume decoding performance would be better if there were integers assigned to these builtins; they’d be a few bytes and less worse to decode.

This would mean another go at the standard but it would fix a concrete problem.

I’m not sure if this has been discussed; I looked a bit in the issue tracker for the Dhall standard and here.

My Experiment

I had a look at the occurrences of builtins in the serialized kubernetes example; there are quite a lot!

dhall decode --file benchmark/examples/kubernetes.dhall.bin | rg 'Text' -c

So there might be a performance improvement to be had.

Gabriel439 · September 13, 2020, 4:25pm

@vmchale: The approach I’m currently taking is to focus on allowing non-normalized cached imports. I think that will probably give the biggest wins in terms of encoding size. Part of the reason why there are so many occurrences of various built-in expressions is because the expression is enormous in general and could be much more compact if it were not normalized.

vmchale · September 13, 2020, 6:34pm

Fair enough! I will await results