How can I always pull my latest docker image but still deterministically record its composition for future reproducibility?

Question

I'm doing analytical work inside a "Lab" docker environment which I manage. I use Travis to build, tag and publish the lab image to a docker container registry (AWS ECR) and then always pull latest image when I start the container to do my analytical work. This ensures I'm always working inside the latest version of the Lab environment. Note: each time Travis publishes a new image, it tags it in ECR with the build git commit ID and latest.

For reproducibility of my analytical results, I would like my python code running inside the container to be able to record in its outputs an identifier that indicates the exact docker image being used. This would enable me to re-download that particular docker image many months/years later from ECR and/or find the git commit from which the docker image was built, run the code again, and (hopefully!) get the same results.

What is the most standard way of achieving this? Can I perhaps store the image digest as an environment variable inside the container?

score 2 · Accepted Answer · answered May 01 '20 at 15:44

There's probably a couple of options, but it depends on how the image is built

Assuming the source code is cloned in CI, and from that source the image is built (so you're not cloning the source code in the Dockerfile), you can use a build-arg to "bake" that commit in the image as an environment variable;

In your Dockerfile, define a build-arg (ARG), and assign its value to an environment variable (ENV). It's needed to assign it to an ENV, because build-args (by design) are not persisted in the image itself (only available during build).

For example:

FROM busybox:latest
ARG GIT_COMMIT=HEAD
ENV GIT_COMMIT=${GIT_COMMIT}

I'm setting a default value, so that the variable contains something "useful" if the Dockerfile is built without passing a build-arg

Then, when building the image, pass the git commit as a build arg

git clone https://github.com/me/my-repo.git && cd my-repo

export GIT_COMMIT=$(git rev-parse --short --verify HEAD)

docker build -t lab:${GIT_COMMIT} --build-arg GIT_COMMIT=${GIT_COMMIT} .

When running the image, the GIT_COMMIT is available as environment variable.

If you want to pass a reference at runtime (when running the image) instead, you can pass a reference when running the image; for example, to pass the digest of the image that you're running;

docker pull lab:latest

export IMAGE_DIGEST=$(docker inspect --format '{{ (index .RepoDigests 0) }}' lab:latest)

docker run -it --rm -e IMAGE_DIGEST=${IMAGE_DIGEST} lab:latest

Thanks! That's really helpful. I tried your first suggestion myself but didn't like the fact that this creates a new image (and pushes to ECR) with every git push of the repo even if the dockerfile and context haven't changed. I think the runtime option you suggest may be a better option for me. — FOXintheBOX, May 01 '20 at 15:56
Are the analytical results generated when running the image or when building the image? And how are the results published? Instead of performing the analysis in a `docker run`, have you considered putting the steps in a `Dockerfile`, and performing them in a `docker build` ? (I obviously don't have a lot of context, so just thinking out loudly) — thaJeztah, May 01 '20 at 16:14
They're generated when _running_ the image. This container is my analytical work environment. I start it up with an interactive shell, do exploratory (Data Science) work in Python from inside of it, generate some results that I save to a mounted volume, then kill the container, go home, and put my feet up for the day. Then I wake up at night realising I can't guarantee the reproduction of those results because they don't include info about the computational environment used to create them. — FOXintheBOX, May 01 '20 at 16:50

score 0 · Answer 2 · answered May 01 '20 at 15:07

0

Append commit id to your image tag.

ex: imagename:dev-v1-bc4da47

where bc4da47 is last commit id

you can get last commit id by

git rev-parse --short HEAD

answered May 01 '20 at 15:07

WSMathias9

669
8
15

Thanks for the suggestion! I already do that, as I mentioned in the "Note". However, that doesn't mean I can see that commit ID when I `docker pull lab:latest` or make it available inside the container for writing out with my results. – FOXintheBOX May 01 '20 at 15:15

score 0 · Answer 3 · answered May 01 '20 at 15:43

0

When you build the image, pass in a build argument with the git hash:

$ docker build --build-arg GIT_HASH=$(git rev-parse --short HEAD) -t yourimage .

And in your Dockerfile you should have a:

ARG GIT_HASH

You should now, I believe, have an environment variable with the git hash available to code running inside the resulting container.

Long version: https://pythonspeed.com/articles/identifying-images/

answered May 01 '20 at 15:43

Itamar Turner-Trauring

3,430
1
13
17

Thanks for this. I read the article you linked to and it led me to try adding `--label git-commit=git-commit` (rather than using `--build-arg`). The main downside of either approach (for me) is that every time my CI builds the repo (which contains other code, not just the docker image), the `label` or `build-arg` changes even if nothing else has, leading to the creation and publishing of new (otherwise identical) images in ECR. – FOXintheBOX May 04 '20 at 09:08

How can I always pull my latest docker image but still deterministically record its composition for future reproducibility?

3 Answers3