6

I am using the https://github.com/puckel/docker-airflow image to run Airflow. I had to add pip install docker in order for it to support DockerOperator.

Everything seems ok, but I can't figure out how to pull an image from a private google docker container repository.

I tried adding the connection in the admin section type of google cloud conenction and running the docker operator as.

    t2 = DockerOperator(
            task_id='docker_command',
            image='eu.gcr.io/project/image',
            api_version='2.3',
            auto_remove=True,
            command="/bin/sleep 30",
            docker_url="unix://var/run/docker.sock",
            network_mode="bridge",
            docker_conn_id="google_con"
    )

But always get an error...

[2019-11-05 14:12:51,162] {{taskinstance.py:1047}} ERROR - No Docker registry URL provided

I also tried the docker_conf_option

    t2 = DockerOperator(
            task_id='docker_command',
            image='eu.gcr.io/project/image',
            api_version='2.3',
            auto_remove=True,
            command="/bin/sleep 30",
            docker_url="unix://var/run/docker.sock",
            network_mode="bridge",
            dockercfg_path="/usr/local/airflow/config.json",

    )

I get the following error:

[2019-11-06 13:59:40,522] {{docker_operator.py:194}} INFO - Starting docker container from image eu.gcr.io/project/image [2019-11-06 13:59:40,524] {{taskinstance.py:1047}} ERROR - ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

I also tried using only dockercfg_path="config.json" and got the same error.

I can't really use Bash Operator to try to docker login as it does not recognize docker command...

What am I missing?

line 1: docker: command not found

t3 = BashOperator(
                task_id='print_hello',
                bash_command='docker login -u _json_key - p /usr/local/airflow/config.json eu.gcr.io'
        )
Tomaž Bratanič
  • 6,319
  • 2
  • 18
  • 31
  • did you enable Google Container Registry API so you can push and pull images , Make sure to attach a Storage Admin role to the service account if you plan to pull and push Docker images. You can view Permissions and Roles for GCR in the Google Cloud Platform documentation. – redhatvicky Nov 13 '19 at 15:15
  • Also try to install docker-py==1.10.6 in the PyPI section of composer. – redhatvicky Nov 13 '19 at 15:46
  • Quick Workaround is that you can use gcloud sdk to authenticate docker. Something like `gcloud auth configure-docker -q && docker pull $(IMAGE_NAME)' which should probably work. – redhatvicky Nov 14 '19 at 13:22
  • You can also follow the below steps to crack this one: 1. Export current deployment config to file `kubectl get deployment airflow-worker -o yaml --export > airflow-worker-config.yaml` 2. Edit airflow-worker-config.yaml (example link) to mount docker.sock and docker, grant privileged access to airflow-worker to run docker commands 3. Apply deployment settings `kubectl apply -f airflow-worker-config.yaml` Source : https://groups.google.com/forum/?hl=zh-CN#!topic/cloud-composer-discuss/pSPKFS7AOj0 – redhatvicky Nov 14 '19 at 13:23

5 Answers5

8

airflow.hooks.docker_hook.DockerHook is using docker_default connection where one isn't configured.

Now in your first attempt, you set google_con for docker_conn_id and the error thrown is showing that host (i.e registry name) isn't configured.

Here are a couple of changes to do:

  • image argument passed in DockerOperator should be set to image tag without registry name prefixing it.
DockerOperator(api_version='1.21',
    # docker_url='tcp://localhost:2375', #Set your docker URL
    command='/bin/ls',
    image='image',
    network_mode='bridge',
    task_id='docker_op_tester',
    docker_conn_id='google_con',
    dag=dag,
    # added this to map to host path in MacOS
    host_tmp_dir='/tmp', 
    tmp_dir='/tmp',
    )
  • provide registry name, username and password for the underlying DockerHook to authenticate to Docker in your google_con connection.

You can obtain long lived credentials for authentication from a service account key. For username, use _json_key and in password field paste in the contents of the json key file.

Google connection for docker

Here are logs from running my task:

[2019-11-16 20:20:46,874] {base_task_runner.py:110} INFO - Job 443: Subtask docker_op_tester [2019-11-16 20:20:46,874] {dagbag.py:88} INFO - Filling up the DagBag from /Users/r7/OSS/airflow/airflow/example_dags/example_docker_operator.py
[2019-11-16 20:20:47,054] {base_task_runner.py:110} INFO - Job 443: Subtask docker_op_tester [2019-11-16 20:20:47,054] {cli.py:592} INFO - Running <TaskInstance: docker_sample.docker_op_tester 2019-11-14T00:00:00+00:00 [running]> on host 1.0.0.127.in-addr.arpa
[2019-11-16 20:20:47,074] {logging_mixin.py:89} INFO - [2019-11-16 20:20:47,074] {local_task_job.py:120} WARNING - Time since last heartbeat(0.01 s) < heartrate(5.0 s), sleeping for 4.989537 s
[2019-11-16 20:20:47,088] {logging_mixin.py:89} INFO - [2019-11-16 20:20:47,088] {base_hook.py:89} INFO - Using connection to: id: google_con. Host: gcr.io/<redacted-project-id>, Port: None, Schema: , Login: _json_key, Password: XXXXXXXX, extra: {}
[2019-11-16 20:20:48,404] {docker_operator.py:209} INFO - Starting docker container from image alpine
[2019-11-16 20:20:52,066] {logging_mixin.py:89} INFO - [2019-11-16 20:20:52,066] {local_task_job.py:99} INFO - Task exited with return code 0
Oluwafemi Sule
  • 36,144
  • 1
  • 56
  • 81
  • When I try to paste the contents of my JSON key into the password box, I get a MySQL database error: `Failed to create record. (_mysql_exceptions.DataError) (1406, "Data too long for column 'password' at row 1")`. Did you run into this issue as well? – Evan Kaeding Jun 29 '20 at 18:54
  • No I didn't experience that. What version of Airflow are you running into this issue on? – Oluwafemi Sule Jun 29 '20 at 19:12
  • Good callout. As of Airflow 1.10.6, the password length was restricted to 500 characters, too long for a Google JSON key file. In version 1.10.7, the password length was increased to 5,000 characters. [Source here](https://github.com/apache/airflow/issues/8417). – Evan Kaeding Jun 29 '20 at 19:15
2

I know the question is about GCR but it's worth noting that other container registries may expect the config in a different format.

For example Gitlab expects you to pass the fully qualified image name to the DAG and only put the Gitlab container registry host name in the connection:

DockerOperator(
    task_id='docker_command',
    image='registry.gitlab.com/group/project/image:tag',
    api_version='auto',
    docker_conn_id='gitlab_registry',
)

The set up your gitlab_registry connection like:

docker://gitlab+deploy-token-1234:ABDCtoken1234@registry.gitlab.com
Tamlyn
  • 22,122
  • 12
  • 111
  • 127
  • 1
    how we can access deploy-token from our gitlab account – Mobin Al Hassan Mar 16 '21 at 17:58
  • this saved me a lot of looking around, though I used the docker login on the machine instead of ```docker_conn_id``` – imsheth Mar 31 '21 at 11:19
  • 1
    Refer this for full info https://docs.gitlab.com/ee/user/packages/container_registry/ and you can get the deploy token created by https://gitlab.com///-/settings/repository. Hope this helps @mobinalhassan – imsheth Mar 31 '21 at 12:25
  • 1
    @imsheth I did not get how to create a connection...can you explain this I'm having problem too long – Mobin Al Hassan May 02 '21 at 17:50
  • I'm getting error of `raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.41/auth` my link docker://gitlab+68hzvkEtonsazVoxNHsy:mobin@registry.gitlab.com – Mobin Al Hassan May 03 '21 at 05:27
  • @mobinalhassan posted as an answer below at https://stackoverflow.com/a/67372027/3152654, hope this helps – imsheth May 03 '21 at 15:57
  • @imsheth How I can do this inside docker? actually, I'm using docker installation of Airflow 2.0 and I want to pull image form gitLab – Mobin Al Hassan May 04 '21 at 05:58
  • I also ask a new question if you can answer with more details I'll thankful. https://stackoverflow.com/questions/67359652/how-to-pull-private-docker-image-from-gitlab-container-registry-using-dockeroper – Mobin Al Hassan May 04 '21 at 06:00
  • 1
    @mobinalhassan the issue with dockerized airflow, on airflow 1.10.x and airflow 2.0.x is majorly the permission issue, which I spent almost 2+ months but couldn't get through - I was stuck as the base image for dockerized airflow wasn't ubuntu or alpine and in the interest of time - I switched to non dockerized airflow to finally make it work – imsheth May 04 '21 at 08:40
  • @imsheth Yes, you define the same problem I'm having and Is this a good idea of using non-dockerize airflow in EC2? – Mobin Al Hassan May 04 '21 at 09:14
  • @MobinAlHassan this is how I could make it run and finally have jotted it down - https://imsheth.com/posts/tags/tech/airflow2-dockeroperator-nodejs-gitlab – imsheth Aug 06 '21 at 09:58
1

Based on recent Cloud Composer documentation, it's recommended to use KubernetesPodOperator instead, like this:

from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator

KubernetesPodOperator(
    task_id='docker_op_tester',
    name='docker_op_tester',
    dag=dag,
    namespace="default",
    image="eu.gcr.io/project/image",
    cmds=["ls"]
    )
Leopoldo Varela
  • 257
  • 3
  • 9
1

Further to @Tamlyn's answer, we can also skip the creation of connection (docker_conn_id) from airflow and use it with gitlab as under

  1. On your development machine :
  • https://gitlab.com/yourgroup/yourproject/-/settings/repository (create a token here and get details for logging in)
  • docker login registry.gitlab.com (on the machine to login to docker from the machine to push the image to docker - enter your gitlab credentials when prompted)
  • docker build -t registry.gitlab.com/yourgroup/yourproject . && docker push registry.gitlab.com/yourgroup/yourproject (builds and pushes to your project repo's container registry)
  1. On your airflow machine :
  • https://gitlab.com/yourgroup/yourproject/-/settings/repository (you can use the above created token for logging in)
  • docker login registry.gitlab.com (to login to docker from the machine to pull the image from docker, this skips the need for creating a docker registry connection - enter your gitlab credentials when prompted = this generates ~/.docker/config.json which is required Reference from docker docs )
  1. In your dag :
dag = DAG(
    "dag_id",
    default_args = default_args,
    schedule_interval = "15 1 * * *"
)

docker_trigger = DockerOperator(
    task_id = "task_id",
    api_version = "auto",
    network_mode = "bridge",
    image = "registry.gitlab.com/yourgroup/yourproject",
    auto_remove = True, # use if required
    force_pull = True, # use if required
    xcom_all = True, # use if required
    # tty = True, # turning this on screws up the log rendering
    # command = "", # use if required
    environment = { # use if required
        "envvar1": "envvar1value",
        "envvar2": "envvar2value",
    },
    dag = dag,
)

this works with Ubuntu 20.04.2 LTS (tried and tested) with airflow installed on the instance

imsheth
  • 31
  • 2
  • 18
  • 36
  • I believe I got this to work with DigitalOcean's container registry. Quick question though, when you pull the image using the DockerOperator, does it store the image on the Airflow machine? Or does it remove it? Appears that auto_remove flag is only for the container – dyao Apr 27 '22 at 20:52
  • `auto_remove` is for garbage collection I believe, `force_pull` pulls/tries to pull a new image every time, if available – imsheth Apr 28 '22 at 12:36
0

You will need to instal Cloud SDK in your workstation which includes the gcloud command-line tool.

After installing Cloud SDK and Docker version 18.03 or newer According to their documentation to pull from Container Registry, use the command:

docker pull [HOSTNAME]/[PROJECT-ID]/[IMAGE]:[TAG] 

or

docker pull [HOSTNAME]/[PROJECT-ID]/[IMAGE]@[IMAGE_DIGEST]

where:

  • [HOSTNAME] is listed under Location in the console. It's one of four options: gcr.io, us.gcr.io, eu.gcr.io, or asia.gcr.io.
  • [PROJECT-ID] is your Google Cloud Platform Console project ID.
  • [IMAGE] is the image's name in Container Registry.
  • [TAG] is the tag applied to the image. In a registry, tags are unique to an image.
  • [IMAGE_DIGEST] is the sha256 hash value of the image contents. In the console, click on the specific image to see its metadata. The digest is listed as the Image digest.

To get the pull command for a specific image:

  1. Click on the name of an image to go to the specific registry.
  2. In the registry, check the box next to the version of the image that you want to pull.
  3. Click SHOW PULL COMMAND on the top of the page.
  4. Copy the pull command, which identifies the image using either the tag or the digest

*Also check that you have push and pull permissions from the registry.

**Configured Docker to use gcloud as a credential helper, or are using another authentication method. To use gcloud as the credential helper, run the command:

gcloud auth configure-docker
Ernesto U
  • 786
  • 3
  • 14