Can I obtain the Docker layer history on non-final stage Docker builds?

Question

I'm working out a way to do Docker layer caching in CircleCI, and I've got a working solution. However, I am trying to improve it. The problem in any form of CI is that the image history is wiped for every build, so one needs to work out what files to restore, using the CI system's caching directives, and then what to load back into Docker.

First I tried this, inspired by this approach on Travis. To restore:

if [ -f /caches/${CIRCLE_PROJECT_REPONAME}.tar.gz ]; then gunzip -c /caches/${CIRCLE_PROJECT_REPONAME}.tar.gz | docker load; docker images; fi

And to create:

docker save $(docker history -q ${CIRCLE_PROJECT_REPONAME}:latest | grep -v '<missing>') | gzip > /caches/${CIRCLE_PROJECT_REPONAME}.tar.gz

This seemed to work OK, but my Dockerfile uses a two-stage build, and as soon as I COPYed files from the first to the final, it stopped referencing the cache. I assume this is because (a) docker history only applies to the final build, and (b) the non-cached changes in the first build stage have a new mtime, and so when they are copied to the final stage, they are regarded as new.

To get around this problem, I decided to try saving all images to the cache:

docker save $(docker images -a -q) | gzip > /caches/${CIRCLE_PROJECT_REPONAME}.tar.gz

This worked! However, it has a new problem: when I modify my Dockerfile, the old image cache will be loaded, new images will be added, and then everything will be stored in the cache. This will accumulate dead layers I will never need again, presumably until the CI provider's cache size limits are hit.

I think this can be fixed by caching all the stages of the build, but I am not sure how to reference the first stage. Is there a command I can run, similar to docker history -q -a, that will give me the hashes either for all non-last stages (since I can do the last one already) or for all stages including the last stage?

I was hoping docker build -q might do that, but it only prints the final hash, not all intermediate hashes.

Update

I have an inelegant solution, which does work, but there is surely a better way than this! I search the output of docker build for --->, which is Docker's way of announcing layer hashes and cache information. I strip out cache messages and arrows, leaving just the complete build layer hash list for all build stages:

docker build -t imagename . | grep '\-\-\->' | grep -v 'Using cache' | sed -e 's/[ >-]//g'

(I actually do the build twice - once for the build CI step proper, and a second time to gather the hashes. I could do it just once, but it feels nice to have the actual build in a separate step. The second build will always be cached, and will only take a few seconds to run).

Can this be improved upon, perhaps using Docker commands?

Why not pull the last image from the registry and use `--cache-from` on your build command? — BMitch, Apr 22 '18 at 13:12
Thanks, hmm. I thought I'd tried that @BMitch, but you've caused me to ponder whether I pulled it first. I think while working on this I've made the assumption that the first stage build is thrown away, and so one cannot refer to a resultant build image in order to obtain the cache for those throwaway layers. Is that not correct? — halfer, Apr 22 '18 at 14:00
Oh, multi stage, that is thrown away if you don't explicitly tag and push it. You could push each stage separately with their own tags, build using the `--target` option, and while rebuilding do the same for each stage with separate `--from-cache` options for each build, building up to your final image. — BMitch, Apr 22 '18 at 14:06
Ah thanks @BMitch, that `target` is what I was looking for when working on this! I wasn't sure non-final build stages were in any way addressable post-build. — halfer, Apr 22 '18 at 14:15
Out of interest, what does `--from-cache` do? My work above loads and saves images based on a the full layer list, and `docker build` doesn't seem to need this extra cache flag - it uses the cache fine on its own. — halfer, Apr 22 '18 at 14:16
Sorry, `--cache-from`, typing on mobile from memory. That tells docker to trust the layers pulled from a remote repository. Without that, layers you pull won't be trusted by the build cache and it will rerun the build step if you haven't already performed it locally. — BMitch, Apr 22 '18 at 14:26
@BMitch: thanks, that fills in the gaps for me. I've summarised this conversation in the comments. Have I missed anything out? — halfer, Apr 22 '18 at 14:33

halfer · Accepted Answer · 2018-04-22T14:50:40.043

This is a summary of a conversation in the comments.

One option is to push all build stages to a remote. If there are two build stages, with the first one being named build and the second one unnamed, then one can do this:

docker build --target build --tag image-name-build .
docker build --tag image-name .

One can then push image-name (the final build artifact) and image-name-build (the first stage, which is normally thrown away) to a remote registry.

When rebuilding images, one can pull both of these onto the fresh CI build machine, and then do:

docker build --cache-from image-name-build --target build --tag image-name-build .
docker build --cache-from image-name --tag image-name .

As BMitch says, the --cache-from will indicate that the images can be trusted for the purposes of using them as a local layer cache.

Comparison

The temporary solution in the question is good if you have a CI-native cache system to store files in, and you would rather not clutter up your registry with intermediate build stage images that are normally thrown away.

The --cache-from solution is nice because it is tidier, and uses Docker-native features rather than having to grep build output. It will also be very useful if your CI solution does not provide a file caching system, since it uses a remote registry instead.

Can I obtain the Docker layer history on non-final stage Docker builds?

Update

1 Answers1

Comparison