How do I identify which DAG / task is causing Node memory issues?

Question

Tasks are regularly failing in our DAGs, and after following Google's troubleshooting steps I've identified the underlying cause to be memory evictions due to insufficient memory.

This matches what I'm seeing in the Memory utilization per node graph in the Composer Monitoring tab. Our machine supports 8 GB nodes and the largest spikes are 16 GB.

Screen shot of memory utilization per node graph, which shows memory spikes

Where I'm stuck is identifying which DAGs are causing the memory spikes. (My assumption is that "DAG A" may be causing the memory spike which led to "DAG B" being evicted). I'd like to revisit the code to see if it can be optimized before increasing the machine size.

How do I connect the dots to understand which tasks were being handled by a given Kubernetes Node at a given time?

score 3 · Accepted Answer · answered May 29 '20 at 10:34

If you need to know which DAG is running in which instance in a given time, one approach is using Airflow Metadata Database. For that, follow the steps below:

Go to GKE page and click on your cluster.
In your cluster's page, click on the Connect button on the top of the page (between Deploy and Duplicate)
A pop-up window will open with two options. Click on Run in Cloud Shell. After that, a Cloud Shell instance will be opened with a command like below. Just press enter to submit this command.
```
gcloud container clusters get-credentials <cluter_id> --zone <cluster_zone> --project <project_id>`
```

Then run the command below to deploy a temporary MySQL image

kubectl run mysql-cli-tmp-deployment \
--generator=run-pod/v1 \
--rm --stdin --tty \
--image mysql:latest \
-- \
  bash

Run the command below to connect to your MySQL instance

mysql \
--user root \
--host airflow-sqlproxy-service.default

Now you can use SHOW DATABASES; to determine what database we need. If you're using new versions of Composer, the database name will be something like composer-< some_version >-airflow-< some_version >-< some_hash >. For old versions, the name will be airflow-db. After determining your database, just enter the command use <your_database>;
Now if you run SHOW TABLES: you will find a lot of different information. For your purpose, I suggest that you run select * from job limit 10; By running this query you will see that this table has many information about the tasks: DAG id, start date, end date ,hostname(there is one hostname for each node in your cluster) [...]
Finally, as a last step I would like to explain how can you determine which is hostname is associated with which node.
1. Go to GKE page and click on Workloads
2. Click on airflow-worker
3. Scroll down to the Managed pods section and you will find the hostnames. If you click on any of them, you will be able to see its node in the next page.

With these information you will be able to run queries to determine which DAG is running in which instance in a given time.

Follow up question - the Composer Monitoring tab lists nodes as an 18 digit numerical ID. Is there a way to see the relationship between that numerical ID and hostname? — Jason Stinnett, May 29 '20 at 14:57
@Jason, in the last step I suggested a way of doing that. Is it possible for you? — rmesteves, May 29 '20 at 15:16
Found it now. The particular ID I was looking for is found by following your instructions in Step 8, then clicking the link to the node, clicking on the details tab, and looking at Annotations - container.googleapis.com/instance_id: — Jason Stinnett, May 29 '20 at 16:06

How do I identify which DAG / task is causing Node memory issues?

1 Answers1