RFC: proxy.dhall-lang.org

Gabriel439 · February 2, 2020, 4:42am

I’m thinking of building a proxy.dhall-lang.org server which can be used as a caching forward proxy that all Dhall implementations can benefit from

The initial motivation for this is the dhall-kubernetes project, which takes a while to import from a cold cache if you are importing the project remotely. Using a forward proxy can reduce the number of HTTP requests from hundreds of requests to just one request.

Here is the idea:

If proxy.dhall-lang.org is reachable the interpreter can optionally use it as a forward proxy

For the Haskell implementation I’m thinking of making this the default with an option to disable
The interpreter can also include an HTTP header containing an integrity check for the import if one is present
The proxy then caches imports protected by integrity checks

… with a size limit per cached import and also a global size limit

The cache key would be the multihash for the integrity check (e.g. 1220…)

philandstuff · February 2, 2020, 10:51am

+1 to this idea.

It sounds very similar to Go’s proxy (docs and useful blog post) which I’ve had nothing but positive experiences with.

Gabriel439 · April 25, 2020, 4:40pm

So the good news is that this is pretty simple to implement. The basic sketch is:

A Dhall interpreter can optionally specify one or more URIs for remote caches
If an expression is protected by an integrity check of the form sha256:${HASH}, then the interpreter checks ${URI}/1220${HASH} for each such URI
file:// schemes are supported for remote caches just like how Go module proxy support works (thanks for the tip, @philandstuff)
We can then host a public store.dhall-lang.org (since cache.dhall-lang.org is already taken) that serves a pre-populated .cache/dhall directory

The main decision left is how to populate the cache for store.dhall-lang.org without (A) being vulnerable to cache poisoning and (B) respecting client privacy. The approach that I’m leaning towards is having the contents of the cache be specified under version control in the dhall-lang.org repository.

In particular, now that we have Nixpkgs support for Dhall, we can specify the desired packages to cache as a Nix derivation that produces a populated cache directory. Then we can pre-cache things like each version of the Prelude and commonly-used repositories like dhall-kubernetes or dhall-bhat, and people can add new cache products by submitting pull requests against the dhall-lang repository.

philandstuff · May 4, 2020, 12:38pm

I’ve been thinking about this and, although this solution is good enough for now, I don’t find it satisfactory or scalable to populate the cache via PR. I think the cache should be automatically populated with anything available on the public internet, just like a Go module proxy is. Each time a remote URL with a hash and without a using directive is encountered, the proxy is tried first, and the proxy tries to fetch the target URL over the public internet.

I’d like to unpack what is meant by “client privacy” here. What are the concerns? I can think of two, quite distinct, concerns:

the problem that confidential data may get into a public proxy
the problem that web requests to a public proxy may reveal, to the proxy’s operators, the existence of Dhall code at private URLs

Have I missed any? In any case, I’ll deal with each here:

Confidential data in a public proxy

If the mechanism to get data into a proxy is that the proxy itself requests it over the public internet, with no mechanism for supplying Authorization or other headers, then the only data that can get into a public proxy can be data that has been published on the public internet. In the first instance, if you don’t want data in a proxy, you “just” don’t publish it publicly.

Of course, accidents can happen, and things are published unintentionally. Then the proxy risks magnifying the mistake: when the original confidential data is taken down, the proxy continues making it available.

The best-case scenario here is that the confidential data is some sort of secret token (a password, private key, etc) which can be quickly and easily rotated (to make the leaked token worthless). But, of course, this is not always easy or possible with all secrets; and there are other kinds of confidential data which aren’t of this category.

We may wish to allow cache purging to happen in such scenarios. However I don’t think this feature is needed for day one of proxy operation.

Leaking the existence of private Dhall resources

The scenario here is: a Dhall implementation sees an import for https://private.widgets-ltd.example.com/dhall/code - a private resource, not reachable from the public internet - and yet the Dhall runtime makes a request to the public proxy to try to fetch the private resource. The proxied request fails, because the private resource is not reachable, but the proxy can still make some inferences about the resources, such as the existence of different bits of code, and their URLs. The URLs may also contain interesting information about the code content.

I don’t know how seriously to take this, because I don’t know how much damage can really be made here. Import URLs in Dhall are static strings - there’s no way to, for example, build a complex query string which might contain sensitive data.

Nevertheless, I also don’t know that this isn’t a problem. If we wanted, we could emulate Go’s GOPRIVATE environment variable, which provides a way to prevent certain modules from being fetched via the proxy. However, I imagine that, in practice, awareness and use of this feature would be relatively low.

Gabriel439 · May 4, 2020, 3:51pm

@philandstuff: One thing to keep in mind is that at some point we may provide shared infrastructure for hosting packages (i.e. a packages.dhall-lang.org), for a few reasons:

To enable CORS for those packages (just like we do for the Prelude)
To provide convenient short-hand URLs that are easier to remember (e.g. instead of raw.githubusercontent.com/…)
To host generated documentation (analogous to Hackage)
To improve package discoverability (e.g. package search or Hoogle-like functionality)

… and if we had such a package registry then we could automatically cache such packages at store.dhall-lang.org without having to create pull requests.

ari-becker · May 6, 2020, 2:48pm

You’re be surprised what organizations consider to be sensitive. Even hostnames / DNS records can be considered sensitive, as they provide directions to potential attackers to potential weak points in the network that might be easier to compromise and gain a foothold for further attacks.

Strictly speaking, I do think that systems which function without uncontrolled outbound connections to the Internet are more mature and resilient than systems which need to connect to the Internet to function. Organizations which take care to use e.g. private Prelude hosting, suddenly to see attempted outbound Internet connections where they didn’t exist before, are probably not going to react very kindly.

philandstuff · May 7, 2020, 8:40am

I work in government, so I’m well aware of this sort of thing in my main workplace, I operate systems with egress proxies to prevent unbounded egress traffic.

However, my experience is that some security controls are more a folk memory or a CYA exercise rather than an appropriate control to a clearly-articulated threat. Some sites run egress proxies as a defence-in-depth measure because they’re trying to mitigate exfiltration traffic in the event of a compromise; other sites run egress proxies because it’s the approved “enterprise architecture” without the ability or autonomy to question this decision.

I would much rather Dhall’s design responds to clearly-articulated threats, rather than merely designing to a claimed “secure architecture” without examining what the security goals are. So if we can collect appropriate threats to add to a threat model for Dhall, we can use that to inform the design of the proxy.

Thinking back to my own workplace, we generally take a skeptical view on the idea that (our own) IP addresses or DNS records are secret information. That’s not to say they never are, but rather a lot of the claims we hear that IPs/DNS names are secret don’t hold up to detailed scrutiny. Often there are better security controls that allow you to not treat these as secret, and you’d rather not make them secret because it’s harder to rotate an IP or DNS name in the event of compromise than something like a password or encryption key.

But I’d also think that an environment that wanted to restrict egress traffic would probably want the ability to run their own Dhall proxy as well. So we might want the equivalent of a GOPROXY environment variable to override the default public proxy or to switch off the proxy entirely.

ari-becker · May 7, 2020, 1:30pm

@philandstuff You’re right, of course. But I find that it’s a left-brain/right-brain conflict. Most organizations that are hyper-sensitive about security can’t point to well-assembled reports detailing exactly who is attacking them and how many resources they have at their disposal. Too often, the people in security roles in big organizations are, pardon my pejorative that I use in an attempt to be illustrative, tinfoil-hatters.

Unfortunately, when you’re trying to sell a new project to people whose decisions are ruled by emotions rather than logic, you need to ensure that the project connects with the audience emotionally. Logical appeals fall on deaf ears. Decision makers who make their decisions from a place of emotion will run if something frightens them, rather than being productively challenged to improve their security posture.

The deeper question is, considering that Dhall (thankfully) values the technically / rationally correct solution to the extreme, whether making the emotional concession impugns upon the core technical values of Dhall and should therefore be rejected. As this is a marginal question to begin with, I’m not sure, and I’m not the person to answer that question, so long as it is answered by the maintainers with full understanding of the underlying concerns.

Gabriel439 · May 7, 2020, 3:06pm

@ari-becker @philandstuff: I want to clarify that I don’t necessarily believe that including the destination in the URL as the Go module proxy does is the rationally correct solution.

One of the things we strive to do is to be as idiomatic to the web as possible, and the Go module proxy idiom seems to go against other web trends that I see. Specifically, most forward proxies that support HTTPS will have the client connect using the CONNECT method, and then let everything else be tunneled inside of that.

Including the destination in the URL as the Go module proxy does seems kind of like a work-around to me that is trying to simulate a forward proxy using a reverse proxy and it doesn’t feel like a web-native solution.

philandstuff · May 7, 2020, 5:01pm

This pattern makes sense where the forward proxy is an egress proxy that applies an allow-list of IP addresses or DNS names, but because after the CONNECT method, all content is end-to-end encrypted, the proxy cannot do caching.

I don’t think there is a “web-native” solution for a caching forward proxy. This is where the traditional web principle of a layered architecture (one of the REST constraints) slams into the modern web principle of end-to-end encryption.

If we want a caching forward proxy, I don’t see much way to achieve it other than by simulating it as a reverse proxy, similar to the Go approach.

Gabriel439 · May 22, 2020, 3:56pm

I wanted to give an update on what has been done so far and what still remains.

I created a store.dhall-lang.org which serves a limited binary cache:

github.com/dhall-lang/dhall-lang

Create `store.dhall-lang.org`

dhall-lang:master ← dhall-lang:gabriel/store.dhall-lang.org

opened 05:02AM - 03 May 20 UTC

Gabriel439

+61 -25

For more details, see: https://discourse.dhall-lang.org/t/rfc-proxy-dhall-lan…g-org/144/3?u=gabriel439 The short summary is that we can declaratively add Dhall packages to cache and make that cache available from `store.dhall-lang.org`. The path of a cached item is the item's multihash. So, for example, the address of version 13.0.0 of the Prelude would be: https://store.dhall-lang.org/12204aa8581954f7734d09b7b21fddbf5d8df901a44b54b4ef26ea71db92de0b1a12 Interpreters can then optionally avail themselves of this shared cache, either automatically or with explicit user consent, in order to speed up package imports. Most of the changes are due to the fact that this required upgrading Nixpkgs in order to pick up the Nixpkgs support for building Dhall packages.

The next step is that I’m creating a dhall-to-nixpkgs utility (i.e. the Dhall analog of cabal2nix) so that we can easily add Dhall repositories to the store. That is pretty close to completion (all that’s missing is documentation), and you can find my work-in-progress branch here:

Once that is done I’ll put up a pull request extending store.dhall-lang.org to cache all recent versions of the Prelude, dhall-kubernetes, and dhall-packages and also add instructions for others to contribute their own packages to the cache via pull request.

After that I’ll advertise that the cache is ready and invite all implementations to optionally use it (and update the Haskell one to do so, too)

Gabriel439 · January 14, 2021, 5:15pm

Alright, store.dhall-lang.org is live now and I described things in more detail here: Store.dhall-lang.org is available