Working with binary form directly in expressions?

ari-becker · March 5, 2020, 8:58pm

I understand that Dhall has a binary encoding (CBOR) that is currently accessible with dhall encode and dhall decode, and because it’s a binary encoding it should hypothetically be more efficient than linted/formatted dhall.

Of course, running dhall <<< "$(dhall encode <<< '{ foo = "bar" }')" gives me a nice error about an invalid byte sequence, since the interpreter can’t work with the binary form directly.

I’m wondering if I’m missing something here? Is there some kind of ./path/to/binary-form.dhallb as CBOR that I’m missing? Is it really worth it to add an additional dhall decode pre-processing step to be able to use the binary form? Is this just immature/underdeveloped/underexplored?

Gabriel439 · March 6, 2020, 12:48am

@ari-becker: The closest thing we have is an import protected by a semantic integrity check if the binary representation is already in the cache. In fact, that’s how the Nixpkgs support for Dhall currently works.

We could add something like that, but I want to understand the use case a bit more first, mainly to see if it overlaps with the idiom I used for the Nixpkgs support.

ari-becker · March 6, 2020, 7:27am

@Gabriel439 basically, we want to ship functions. We have a common pattern of a typed super-configuration + a function that turns the super-configuration into the configuration for a specific tool e.g. Kubernetes manifests with dhall-kubernetes + a script that glues the super-configuration, the function, and the resulting configuration together with applying the configuration idempotently.

One way of shipping the functions is to put it on a server, serve it over the network, and use standard http://path/to/import sha256:some-hash imports to fetch and cache the import. And this could work… but we’ve found that it kind of sucks in our use-case. We apply our configuration in Concourse workers, and the caching mechanism in Concourse provides separate caches per containerized script (task) and per worker, which to be fair to Concourse fits Concourse’s vision fairly well. So on a brand-new worker, we may have to repopulate the entire cache from nothing arbitrary-n times for a given number of tasks and pipelines. It’s not really acceptable for us from a performance standpoint, so one way we were thinking of solving this is to add various packages that we’re using into our build container and then use /usr/share/dhall/path/to/import ? http://path/to/import sha256:some-hash to drastically speed things up. And if we could get some kind of additional performance benefit from /usr/share/dhall/path/to/import.dhallb as CBOR then why not?

I saw the work you did for Nixpkgs, which would seem on its face to be a strictly better solution to the problem because it doesn’t require changing any of the Dhall code to use /usr/share/dhall/... ? conditional imports and pre-populating the cache means that no time is wasted populating the cache from disk. The issue I have with it is, that’s great if your build container target is NixOS, but currently our build container target is Fedora because we’re using buildah to build new container images; and getting buildah to install on anything other than Fedora is a pain right now. I guess we could install nix inside the Fedora target if we needed to but I’m still reticent on throwing full weight behind building out Nix infrastructure internally (including Hydra etc.) because of the additional maintenance burdens and because we’re a 100% Kubernetes shop and getting Hydra running on Kubernetes is not exactly a well-tread path. Importing from a standard-ish file path like /usr/share/dhall is simple and understandable, lets me keep the Concourse task cache directory (which is valuable when actually using the shipped configuration), and so far doesn’t require me to build out tooling whereby I examine /home/worker/.cache/dhall/* and write a function that every task must call in the beginning that determines whether or not it should be copied over into /tmp/build/<hash>/xdg-cache/dhall/, at which point any performance benefits achieved from pre-populating the cache are slim to none.

tristanC · March 6, 2020, 2:46pm

Why are you using an extra /tmp/build/<hash>/xdg-cache/dhall/ cache directory?

Not sure if this applies to your problem, but we are also using a fedora based container, and fwiw, to speed up the initial import of external packages we are evaluating a dumb expression with dhall-to-json to populate the cache, fwiw here is the Dockerfile.

Also note that the dhall tools are being packaged as RPM, and perhaps we could also package dhall bindings so that you could dnf install dhall-kubernetes. On the other hand, I hope that the proposed proxy.dhall-lang.org service would make such pre-caching much easier too.

Gabriel439 · March 6, 2020, 4:40pm

@ari-becker: What I take from this is that we need to package Dhall for Fedora using the same approach we did for Nixpkgs

ari-becker · March 6, 2020, 5:00pm

@tristanC the /tmp/build/<hash> directory is the directory which Concourse creates to set up the build environment. It’s not something that we have control over, or are meant to have control over; Concourse’s opinionated stance is that you’re given a current directory, everything that Concourse manages is put into that directory, so the location of that directory is unimportant and you should never refer to the build directory as an absolute path in any supporting scripts etc. It’s important in this context because when you use Concourse you’re forced to specify cache directories relative to the current directory which Concourse drops you in; if you refer to an absolute path like /home/somebody/.cache then it needs to be baked into the image which Concourse launches.

@Gabriel439 I’m not sure how sustainable that is? We might be using Fedora but I’m sure other people are using Ubuntu and Arch and a huge number of other distributions; furthermore, the kinds of projects which we’d like to have access to are projects that we own like dhall-kops, dhall-prometheus-operator, dhall-aws that don’t necessary match the fit/finish expectations of projects like dhall-packages (we feel more at liberty to take a cowboy approach to updates and documentation when we’re the only people using our open-sourced software so far). Should we be expected to maintain public RPMs and DEBs and AUR packages etc. for our own software? How do we keep the same quick update cycle that we’re used to for what is essentially (particularly in the case of dhall-aws) unfinished (definitely at least unstable) software, if we need to work with public packagers/maintainers? If an RPM (and I know about this because I used to write and maintain rpmspecs years ago, in a different job) essentially boils down to scripts - why not just run these scripts directly in our build container, for each package that we need to pre-cache?

The naive solution is to have RUN dhall resolve --file /path/to/imports.dhall be part of the Dockerfile, but again, the issue I have with that is that Concourse will either end up pointing to /tmp/build/<hash>/xdg-cache/dhall, which will be empty, or Concourse will default to using /home/worker/.cache/dhall, where the cached expressions evaluated during the build will be wiped out when the container is erased at the end of the build. And I’m not sure how packaging Dhall for Fedora solves that issue.

tristanC · March 7, 2020, 2:44pm

@ari-becker thanks for the concourse explanation.

It seems like the culprit is how dhall looks for cached data in a single location based on the xdg home. Perhaps dhall could fallback to a default site location such as /usr/lib/dhall/*/ ? Then packager could drop the binary form of libraries in that location, for example a dhall-kubernetes package would provides a

/usr/lib/dhall/kubernetes directory with:
- package.dhall file with a https://package-original-url package-digest
- cache directory with the binary form of the digest
- README, LICENSE, …

This directory could be maintained by the system package manager, and/or dhall could also provides an install sub command.

ari-becker · March 7, 2020, 5:01pm

@tristanC I like where you’re going with that idea, but the specific /usr/lib/dhall/<x> directory shouldn’t be a standard, as it’s distribution-specific (which the issue with using a conditional import to point to a specific filepath, it only works as long as you don’t try to use it in a different distribution).

Maybe a good way of dealing with the issue is to take the current single cache directory and separate it into two cache directories - one used for semi-semantic caching (i.e. #1154, matching the Concourse cache of /tmp/build/<hash>/dhall-semi-semantic-cache above) and one that can be used for “installations”?

Gabriel439 · March 8, 2020, 12:08am

@ari-becker: Just to clarify: I did not mean to suggest that Dhall projects would need to be written to be amenable to package managers. For example, the Nixpkgs support for Dhall that I added works for any Dhall package where remote imports are frozen, without any changes to the package. That was how I was able to package dhall-packages for Nix without upstreaming any changes to it:

github.com

NixOS/nixpkgs/blob/master/pkgs/development/dhall-modules/dhall-packages.nix#L32-L65


"0.11.1" =
  let
    k8s_6a47bd = dhall-kubernetes."3.0.0".override {
      rev    = "6a47bd50c4d3984a13570ea62382a3ad4a9919a4";
      sha256 = "1azqs0x2kia3xw93rfk2mdi8izd7gy9aq6qzbip32gin7dncmfhh";
    };


    k8s_4ad581 = dhall-kubernetes."3.0.0".override {
      rev    = "4ad58156b7fdbbb6da0543d8b314df899feca077";
      sha256 = "12fm70qbhcainxia388svsay2cfg9iksc6mss0nvhgxhpypgp8r0";
    };


    k8s_fee24c = dhall-kubernetes."3.0.0".override {
      rev    = "fee24c0993ba0b20190e2fdb94e386b7fb67252d";
      sha256 = "11d93z8y0jzrb8dl43gqha9z96nxxqkl7cbxpz8hw8ky9x6ggayk";
    };


  in
    { rev    = "8d228f578fbc7bb16c04a7c9ac8c6c7d2e13d1f7";
      sha256 = "1v4y1x13lxy6cxf8xqc6sb0mc4mrd4frkxih95v9q2wxw4vkw2h7";

This file has been truncated. show original

The general architectural idiom I’m trying to preserve is that the only tool necessary to author a package is a text editor. In particular, I’m trying to avoid multi-step publication process where users have to first author the code as Text, then do a separate post-processing step to convert it to CBOR (or any other post-processing step). The binary representation is intended to be a transparent optimization handled by the runtime, rather than by the user.

For example, I would be fine with extending the standard so that an interpreter could specify Accept: application/dhall.cbor or something similar when importing an expression and then the server could optionally serve the CBOR-encoded version, but that again is an implementation detail of the runtime, not something that the user should be aware of.

philandstuff · March 8, 2020, 6:46am

I like the idea of not requiring as CBOR to import binary dhall, and for remote imports, content negotiation seems like the obvious implementation choice. (Side note; the mime type should be application/dhall+cbor. If we agree to pursue this, we should probably register mime types with IANA).

However we need a way to support this with local imports, so we need a way of determining if a local file is dhall text or CBOR. I can think of a few options here:

file extension based: .dhallb files are parsed as CBOR, all others as text
sniffing: read a few bytes from the start and try to guess if it’s text or CBOR (for example, invalid Utf-8 sequences would indicate CBOR)
self-describing CBOR: require CBOR files to start with the magic self-describing tag 55799 that we already support, and detect that specific byte sequence.

This last option is my strong preference. Sniffing is error-prone and has introduced security bugs in other software. File extensions are inflexible and still basically require the source code to know if it importing CBOR or text, which is something I’d like to avoid.

(That said, we could finesse the file extension option by having the import resolution process tack .dhallb on to the end of the requested file: if it exists, then parse as CBOR, if not, parse the original file as text. So an import of ./foo would pull in ./foo.dhallb if it exists. I still prefer the self describing CBOR option.)

Finally, if we’re going to do this for local imports, we might as well do the same for http and ignore mime types.

Gabriel439 · March 9, 2020, 11:08pm

@philandstuff: Self-describing CBOR would also be my preference, although the use case for CBOR-encoded local imports seems less compelling than for remote imports because as far as I can tell the only benefit is conserving disk space. For example, an implementation could preserve most of the decoding speed gains by textually hashing files and remembering their CBOR representation in a content-addressable store (where the address is the hash of the raw text).