I’ve been thinking about this and, although this solution is good enough for now, I don’t find it satisfactory or scalable to populate the cache via PR. I think the cache should be automatically populated with anything available on the public internet, just like a Go module proxy is. Each time a remote URL with a hash and without a using
directive is encountered, the proxy is tried first, and the proxy tries to fetch the target URL over the public internet.
I’d like to unpack what is meant by “client privacy” here. What are the concerns? I can think of two, quite distinct, concerns:
- the problem that confidential data may get into a public proxy
- the problem that web requests to a public proxy may reveal, to the proxy’s operators, the existence of Dhall code at private URLs
Have I missed any? In any case, I’ll deal with each here:
Confidential data in a public proxy
If the mechanism to get data into a proxy is that the proxy itself requests it over the public internet, with no mechanism for supplying Authorization or other headers, then the only data that can get into a public proxy can be data that has been published on the public internet. In the first instance, if you don’t want data in a proxy, you “just” don’t publish it publicly.
Of course, accidents can happen, and things are published unintentionally. Then the proxy risks magnifying the mistake: when the original confidential data is taken down, the proxy continues making it available.
The best-case scenario here is that the confidential data is some sort of secret token (a password, private key, etc) which can be quickly and easily rotated (to make the leaked token worthless). But, of course, this is not always easy or possible with all secrets; and there are other kinds of confidential data which aren’t of this category.
We may wish to allow cache purging to happen in such scenarios. However I don’t think this feature is needed for day one of proxy operation.
Leaking the existence of private Dhall resources
The scenario here is: a Dhall implementation sees an import for https://private.widgets-ltd.example.com/dhall/code - a private resource, not reachable from the public internet - and yet the Dhall runtime makes a request to the public proxy to try to fetch the private resource. The proxied request fails, because the private resource is not reachable, but the proxy can still make some inferences about the resources, such as the existence of different bits of code, and their URLs. The URLs may also contain interesting information about the code content.
I don’t know how seriously to take this, because I don’t know how much damage can really be made here. Import URLs in Dhall are static strings - there’s no way to, for example, build a complex query string which might contain sensitive data.
Nevertheless, I also don’t know that this isn’t a problem. If we wanted, we could emulate Go’s GOPRIVATE environment variable, which provides a way to prevent certain modules from being fetched via the proxy. However, I imagine that, in practice, awareness and use of this feature would be relatively low.