Adding large files to docker during build

Question

My service needs some large files when it is running (~ 100MB-500MB) These files might change once in a while, and I don't mind to rebuild my container and re-deploy it when it happens.

I'm wondering what is the best way to store it and use it during the build so anyone in the team can update the container and rebuild it.

My best idea so far is to store these large files in git LFS in a different branch for each version. So that I can add it to my Dockerfile:

RUN git clone -b 'version_2.0' --single-branch --depth 1 https://...git.git

This way, if these large files change, I just need to change the version_2.0 in the Dockerfile, and rebuild.

Is there any other recommended way? I considered storing these files in Dropbox, and just get them with a link using wget during build

P.S - These large files are the weights for some Deep-Network

Edit - The question is what a reasonable way to store large files in a docker, such that one developer/team can change the file and matching code, and it will be documented (git) and can easily be used and even deployed by another team (for this reason, just large files on the local PC ir bad, because it needs to be sent to another team)

I would try with rsync, rsync works ok trough ssh and copies only the parts of the files that have changed so it's very efficient and you copy files when you have to. — Marco, Feb 01 '19 at 13:33
Instead of modifying the Dockerfile each time, you could use `docker build --build-arg VERSION="version_2.0" ...`. Your Dockerfile can obtain `$VERSION` as an environment variable thanks to the Dockerfile `ARG VERSION` instruction. I just need to mention that this kind of use is discouraged: that's not in the Docker philosophy (a specific Dockerfile should always build the same stack, it is not meant to be a provisioner such as Vagrant), and that instruction was motivated at the basis by the need to set up a proxy at the build time. — arvymetal, Feb 05 '19 at 03:31
I don't get exactly what you're asking for... Do you wonder what would be the best service to store large files? Or if you should embed the large files inside the Docker container? Or what would be the best Docker pattern to version and provide these files? And by the way, would everybody in the team have access to the storage (Git LFS, DropBox etc...) or not? — arvymetal, Feb 05 '19 at 03:57

score 10 · Answer 1 · answered Feb 03 '19 at 12:38

These files might change once in a while, and I don't mind to rebuild my container and re-deploy it when it happens.

Then a source control is not the best fit for such artifact.

A binary artifact storage service, like Nexus or Artifactory (which both have free editions, and have their own docker image if you need one) is more suited to this task.

From there, your Dockerfile can fetch from Nexus/Artifactory your large file(s).
See here for proper caching and cache invalidation.

score 5 · Answer 2 · answered Feb 05 '19 at 03:03

5

I feel that I must be misreading your question, because the answer seems blindingly obvious to me, but none of the other respondents are mentioning it. So please kindly forgive me if I am vastly misinterpreting your problem.

If your service needs large files when running and they change from time to time, then

do not include them in the image; but instead
mount them as volumes,

answered Feb 05 '19 at 03:03

emory

10,725
2
30
58

1

The best answer hands-down. Maybe it would be worthwile to update your answer with some information about `persisting volumes`, as seen, for example, in [kubernetes documentation](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)? It would allow for containers to be up and running and share/update the data between teams. – Szymon Maszke Feb 10 '19 at 03:00
@SzymonMaszke Thank you for your vote of confidence. I think it would be best if you wrote that answer, because I am not familiar with the persisting volumes concept, but perhaps I should be. – emory Feb 10 '19 at 08:47
1

Wrote the basic idea as another answer. And no problem, this approach is much better than current top answer IMO. – Szymon Maszke Feb 10 '19 at 14:03

score 4 · Accepted Answer · answered Feb 09 '19 at 15:13

It actually comes down to how you build your container, For example we build our containers using Jenkins & fabric8 io plugin as part of maven build. We use ADD with remote source url (Nexus).

In general , you can use a URL as source. so it depends which storage you have access to. 1. you can create an s3 bucket and provide access to your docker builder node . You can add ADD http://example.com/big.tar.xz /usr/src/things/ in your docker file to build

you can upload the large files into artifact repository (Such as Nexus or Artifactory) and use it in ADD
if you're building using Jenkins, in the same host create a folder and configure the webserver to serve that content with a virtualhost config. Then use that Url.

Optimal solution would be the one which is cheaper in terms of effort and cost without compromising on security.

score 2 · Answer 4 · edited Jun 20 '20 at 09:12

Volumes living as separate nodes and shared among teams

Just to complement @emory's answer, I would advise you to use Kubernetes' Persistent Volumes for your exact case.

How would those help?

As you said, there are multiple teams, each team may run a POD, which is in simple terms a group of containers and specification of their interaction (like starting, passing data etc.). In other words it's a logical connection between multiple containers. Such PODs are usually run on a cluster and managed by Kubernetes' engine.

Persistent Volumes are another resource in the cluster containing data. In comparison to regular volumes, those reisde in cluster and can be accessed by different PODs by specifying PersistentVolumeClaims.

Using this approach you can:

have zero down-time of your containers (replication of PODs in cluster as needed)
update weights of your network by anyone in the team (or the subset of your team)
fetch updates of weights from PODs without inteferring with containers

IMO this approach is more sustainable long term than merely rebuilding containers each time your data is changed.

score 0 · Answer 5 · answered Feb 03 '19 at 12:58

If you have a private docker registry, you could build base images with those files already included. Then in your service's Dockerfile have FROM instruction pointing to that base image.

Then, when other team members want to update they just update the FROM instruction on the Dockerfile.

With this approach it is not relevant where you keep the original files, since they are only used once, when you build the base image.

score 0 · Answer 6 · answered Feb 08 '19 at 08:34

If you make sure to make those files are the last (or one of the last) step in building the image, the build can make use of the cache from previous versions. The only thing that will be rebuilt is the layer containing the big files (and any steps after that).

Downloading that new image will also just download this last layer.

As for redeploying, you will need to make sure that all data (configs, tmp, ...) is stored in a volume. The "redeploy" can then use docker run ... --volumes-from=old-container ... and be instantly available again.

score 0 · Answer 7 · answered Feb 09 '19 at 07:44

0

If you are considering even Dropbox, why don't you consider AWS S3? Or you can mount them in some volume or file system.

answered Feb 09 '19 at 07:44

deosha

972
5
20

Adding large files to docker during build

7 Answers7

Volumes living as separate nodes and shared among teams

How would those help?

Using this approach you can:

Linked