2

I have had this problem for several weeks now, so I wanted to know if someone else has encountered the same issue: I have a pipeline orchestrated by Cloud Composer (Google Cloud Platform's managed version of Apache Airflow). I have some tasks coded in PySpark that need to be executed on a Cloud Dataproc cluster.

As a result, I use the DataprocPySparkOperator. It works perfectly fine for most tasks. Nonetheless, as soon as a task's execution duration exceeds one hour, the task is (almost) systematically marked as "failed" on the Airflow UI, although the task is still running fine on Dataproc.

I have done a fair amount of research on the Internet, and read about the visibility timeout of Celery (the mandatory executor for Cloud Composer) here. I also read that there previously was a similar issue with the Kubernetes Operator due to the Kubernetes client itself (for instance here).

Nonetheless I never found a post dealing with a timeout issue related to DataprocPySparkOperator. I have tried changing the image version of Composer and setting the visibility timeout to 10 hours in the Airflow configuration overrides, with mitigated results.

EDIT It seems like changing the visibility timeout of Celery in the Airflow configuration (see screenshot ) has solved the issue, although I need to wait it out a few more runs to confirm the issue is effectively solved. I would still like to understand why this issue appeared only for Dataproc, and how this visibility timeout influences the reliability of Airflow.

NJDV
  • 21
  • 2
  • Can you add logs please, will allow us to help you better – kaxil Sep 12 '19 at 17:02
  • I don't have logs saved right now I'll try to get them if I encounter the issue again. – NJDV Sep 13 '19 at 10:13
  • I am using airflow (Cloud Composer) with long running (3-4h) pyspark jobs without any issues. The DAG is running daily for months. I suggest you to upgrade your Composer version to the last and avoid changing aitflow.cfg. – Alan Borsato Dec 21 '19 at 22:49

0 Answers0