Compressing cached Dhall expressions

vmchale · April 28, 2020, 6:23pm

Seems to be a nice opportunity!

Based on my measurements with time in the shell, zstd looks good compared to lz4.

vanessa@vanessa-desktop /tmp 🌸 bench "dhall decode --file expr.dhalli" "lz4 -cd expr.dhalli.lz4 | dhall decode" "zstd -cd expr.dhalli.zst | dhall decode"
benchmarking bench/dhall decode --file expr.dhalli
time                 3.531 s    (3.458 s .. 3.674 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.441 s    (3.406 s .. 3.482 s)
std dev              44.79 ms   (18.55 ms .. 61.70 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking bench/lz4 -cd expr.dhalli.lz4 | dhall decode
time                 3.485 s    (3.477 s .. 3.494 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.430 s    (3.393 s .. 3.449 s)
std dev              34.97 ms   (4.647 ms .. 44.96 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking bench/zstd -cd expr.dhalli.zst | dhall decode
time                 3.410 s    (3.237 s .. 3.582 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 3.381 s    (3.356 s .. 3.419 s)
std dev              35.17 ms   (964.0 μs .. 43.54 ms)
variance introduced by outliers: 19% (moderately inflated)

…so at least in that case compression isn’t the bottleneck

philandstuff · April 28, 2020, 7:43pm

Interesting! Zstd is new to me but it looks good for our use case from a cursory glance.

Could you provide some context? I think it’s worth examining what problems you think this could help with. This might seem obvious (disk space, of course), but what is the magnitude of the disk space saving you could make here? How large are the Dhall caches you see in practice and what could we save by compressing them? What practical things could the disk space saving enable?

We should also note the cost: every feature we add to the standard is something that has to be done for every Dhall implementation.

vmchale · May 1, 2020, 12:31am

Could you provide some context

Well, it might be partly my own style - one of my cached Dhall expressions was 4.9 MB. It would I imagine save a few hundred megabytes if you have a lot of expressions…

I would guess that it’s not simply a question of disk space, it could also give speed improvements.

We should also note the cost: every feature we add to the standard is something that has to be done for every Dhall implementation.

That’s a good point. zstd is new enough that bindings might not exist for every language.

I’m not too attached to any particular compression library.