2

I am developing an ETL process to be scheduled and orchestrated with Apache Airflow using the DockerOperator. I am working on a Windows Laptop, so I can only run Apache Airflow from inside a docker container. I was able to mount a folder on my windows laptop with config files (called configs below) into the airflow container (named webserver below) using a volume specified in the below docker-compose.yml file residing in my project root directory. The relevant code from the docker-compose.yml file can be seen below:

version: '2.1'
    webserver:
        build: ./docker-airflow
        restart: always
        privileged: true
        depends_on:
            - mongo
            - mongo-express
        environment:
            - LOAD_EX=n
            - EXECUTOR=Local
        volumes:
            - ./docker-airflow/dags:/usr/local/airflow/dags
            # Volume for source code
            - ./src:/src
            - ./docker-airflow/workdir:/home/workdir
            # configs folder as volume
            - ./configs:/configs
            # Mount the docker socket from the host (currently my laptop) into the webserver container so that the webserver container can create "sibbling" containers
            - //var/run/docker.sock:/var/run/docker.sock  # the two "//" are needed for windows OS
        ports:
            - 8081:8080
        command: webserver
        healthcheck:
            test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
            interval: 30s
            timeout: 30s
            retries: 3
        networks:
            - mynet

Now I want to pass this configs folder with all its content on to the containers which are created by the DockerOperator. Although this configs folder was apparently mounted into the webserver container's file system, this configs folder is completely empty and because of that, my DAG fails. The code for the DockerOperator is as follows:

cmd = "--config_filepath {} --data_object_name {}".format("/configs/dev.ini", some_data_object)
        staging_op = DockerOperator(
            command=cmd,
            task_id="my_task",
            image="{}/{}:{}".format(docker_hub_username, docker_hub_repo_name, image_name),
            api_version="auto",
            auto_remove=False,
            network_mode=docker_network,
            force_pull=True,
            volumes=["/configs:/configs"]  # "absolute_path_host:absolute_path_container"
        )

According to the documentation, the left side of the volume must be an absolute path on the host, which (if I understood correctly) is the webserver container in this case (because it creates separate containers for every task). The right side of the volume is a directory inside the task's container which is created by the DockerOperator. As mentioned above, the configs folder inside the task's container does exist, but is completely empty. Does anyone know why this is the case and how to fix it?

Thank you very much for your help!

Kevin Südmersen
  • 883
  • 2
  • 14
  • 24

2 Answers2

2

In this case the container started from the airflow docker operator runs 'parallel' to the airflow container, supervised by the docker service on your host.
All the volumes declared in the docker operator call must be absolute paths on your host.
Volume definitions in docker-compose are somewhat special, in this case relative paths are allowed.

sarnu
  • 363
  • 1
  • 2
  • 8
  • Thanks a lot for your answer! Relative paths would be awesome, because I would like to avoide paths which would only work on my windows laptop, like `C:Users/kevin...`. I will test it and then get back to you. – Kevin Südmersen Mar 10 '20 at 17:01
  • So, I finally had the chance to test what you suggested. In the docker operator I passed a list of volumes like this: `volumes=['C:\\Users\\kevin\\dev\\my_project\\data\\tmp:/data/tmp', 'C:\\Users\\kevin\\dev\\my_project\\data\\extracts:/data/extracts']`, but when airflow wanted to execute this operator, I got the error message: `500 Server Error: Internal Server Error ("invalid mode: /data/tmp")`. Any ideas how that might have happened? – Kevin Südmersen Aug 06 '20 at 15:17
  • Also, if I want to use relative paths, which directory would the path be relative to? – Kevin Südmersen Aug 06 '20 at 15:22
  • Or, is there a way to define the airflow container as the host of the container executing the task, i.e. docker-in-docker? – Kevin Südmersen Aug 06 '20 at 16:24
  • I have no experience running Docker under Windows and really am surprised you can map Windows paths like that to directories in a container. Regarding the 500-error I would suspect permission problems. Do a docker exec -it bash and have a look at the permissions of the mounted directory. – sarnu Aug 07 '20 at 08:42
  • As for docker-in-docker: When I searched for solutions for running containers from Airflow running in a container, I came across the docker-in-docker concept. But it was regarded as a crazy concept and mounting the docker socket inside the container was seen as a better approach. – sarnu Aug 07 '20 at 08:48
  • thanks for you help. After googling around a bit, it turned out that the windows paths need to start with `/c/path/to/file`, instead of `C:\\path\\to\\file`, or `C:/path/to/file` – Kevin Südmersen Aug 07 '20 at 11:56
2

After implemententing the suggestions from here, the volumes in the constructor of the DockerOperator need to be specified as follows:

cmd = "--config_filepath {} --data_object_name {}".format("/configs/dev.ini", some_data_object)
        staging_op = DockerOperator(
            command=cmd,
            task_id="my_task",
            image="{}/{}:{}".format(docker_hub_username, docker_hub_repo_name, image_name),
            api_version="auto",
            auto_remove=False,
            network_mode=docker_network,
            force_pull=True,
            volumes=['/c/Users/kevin/dev/myproject/app/configs:/app/configs']  # "absolute_path_host:absolute_path_container"
        )

Maybe the file paths need to look like that, because Docker runs inside a VM on Windows?

As @sarnu also mentioned, it is important to understand, that the host-side paths are paths on my windows laptop, because the containers created for each task run in parallel / are sibbling containers to the airflow container.

Kevin Südmersen
  • 883
  • 2
  • 14
  • 24