Tasks are regularly failing in our DAGs, and after following Google's troubleshooting steps I've identified the underlying cause to be memory evictions due to insufficient memory.
This matches what I'm seeing in the Memory utilization per node graph in the Composer Monitoring tab. Our machine supports 8 GB nodes and the largest spikes are 16 GB.
Screen shot of memory utilization per node graph, which shows memory spikes
Where I'm stuck is identifying which DAGs are causing the memory spikes. (My assumption is that "DAG A" may be causing the memory spike which led to "DAG B" being evicted). I'd like to revisit the code to see if it can be optimized before increasing the machine size.
How do I connect the dots to understand which tasks were being handled by a given Kubernetes Node at a given time?