Proposal: control cache sources during build with --cache-from

This proposal was originally from https://github.com/docker/docker/pull/24711#issuecomment-237666954.

With several changes to the image format, the scope of build caching has been limited to local nodes. This is problematic for architectures which dispatch builds to arbitrary nodes, since pulling new images and data will not fulfill the build cache.

The main concern here is [Cache poisoning](https://www.owasp.org/index.php/Cache_Poisoning). The worst part about is that it is not at all obvious that you are affected or protected. It can only be mitigated by limiting the horizon of data that one trusts.

Anything that circumvents that protection, even `docker save/load`, is going to open your infrastructure up to injection of malicious content. The proposal in #20316 and previous proposals have not addressed this problem. While we all want fast builds (and I really do), introducing cache poisoning to the build step of the infrastructure must be avoided. Could you imagine the impact if someone could just inject a malicious layer into `library/ubuntu` or `library/alpine`?

The other aspect to this is the misapplied assumption about the idempotence of shell commands. `apt-get update` run twice is _never_ guaranteed to have the same result. Ever. That is just not how it works. If you have a build cache that is never purged, you will never update your upstream software. That may or may not be the intent. Even worse, if this build cache gets filled in with remote data, you probably have no visibility into when that command was run.

The underlying problem here is that with 1.10 changes in the image format, we no longer restore the parent image chain when pulling from a registry. As such, a proper solution to this problem involves something that can control the level of trust for content to a distributed build cache.

Let's look at how we build an image, with `FROM alpine` at the top:

```
docker pull mysuperapp:v0
docker build -t mysuperapp:v1 .
```

In this simple case, we cannot assume that a remote `mysuperapp:v0` and the ongoing build are related, since that would possibly introduce the cache poisoning scenario that we need to avoid. However, one may have local registry infrastructure that they know they can trust. While we can infer parentage (despite other assertions, this is still possible), we may not be able trust that parentage from a build caching perspective from _all_ registries. But, this build environment is special.

What better way than to tell the build process process that you can trust a related image?

``` console
docker build --cache-from mysuperapp:v0 -t mysuperapp:v1 .
```

The above would allow `Dockerfile` commands to be satisfied from the entries of `mysuperapp:v0` in the build of `mysuperapp:v1`. Job done!

No! We still have a problem. Now, my build system has to know tag lineage (mysuperapp:v0`<`mysuperapp:v1`). Let's modify the meaning the tagless reference to mean something slightly different:

```
docker pull -a mysuperapp
docker build --cache-from mysuperapp -t mysuperapp:v1 .
```

In the above, we pull _all_ the tags from `mysuperapp`, any layer of which can satisfy the build cache. In practice, this probably is a little wide for most scenarios, so we can allow multiple `--cache-from` directives on the command line:

```
docker build --cache-from mysuperapp:v0 docker build --cache-from mysuperapp:v1 -t mysuperapp:v2 .
```

There are many possibilities here to make this more flexible, such as running a registry service specially for the purpose of build caching `mybuildcache.internal/mysuperapp`. Did you know that you can just run a registry and rsync the filesystem around without locking? You can also rsync from multiple sources and merge the result safely (kind of). Such a registry can be purged periodically (or some one could submit a PR to purge old data).

We can take this even further, but I hope the point is brought home. This is probably less convenient that the original behavior, but it is a good trade off. It leverages the existing infrastructure and has the possibility of being extended as use cases change.

Closes #18924.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: control cache sources during build with --cache-from #26065

stevvooe
openedon Aug 26, 2016

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: control cache sources during build with --cache-from #26065

Description

stevvooeopenedon Aug 26, 2016

Metadata