tl;dr Apache Beam pipeline step involes building docker image; How to run this pipeline using Google Dataflow? What alternatives exist?
I'm currently trying make my first steps with google's dataflow service and apache beam (python).
Trivial examples are pretty straight forward but things get confusing to me as soon as external software dependencies come into play. It seems to be possible to use custom docker containers to setup ones own environment [1][2]. While that's great for most dependencies, it doesn't help, if the dependency is docker itself, as it happens to be the case for me: One step of my pipeline involves using an external project which makes heavy use of docker (i. e. building images, running them)
As far as I can tell there are three options to tackle that problem:
- Docker within Docker Run the external project's scripts which build docker images within a docker container running on a dataflow worker node. While building docker image within docker is possible in principle [3] I've got the feeling that won't work in this case, since there is only very limited control over the environment.
- Custom VM image for worker nodes Is it possible to use custom vm images for dataflow worker nodes?
- Don't use Google Dataflow What are better suited alternative services?
Thanks!
[1] Custom VM images for Google Cloud Dataflow workers
[2] https://cloud.google.com/dataflow/docs/guides/using-custom-containers
[3] https://www.docker.com/blog/docker-can-now-run-within-docker/
Edit: Added line breaks.