Can't get Cloud Data Fusion run to stop

Question

I have several Fusion pipelines that all do the same basic tasks: insert data into a table in Bigquery, loading it into S3 and then truncating the Bigquery table. Everything looks ok until I get 'pipeline xxx succeed' log but then it goes into a really long loop of:

Failed to fetch monitoring messages for program program_run:default.xxx.-SNAPSHOT.workflow.DataPipelineWorkflow.yyy

and at the end just getting stuck on following error:

Failed to monitor the remote process and exhausted retries. Terminating the program program_run

I tried to abort the entire run using the stop button, stopping the DataPipelineWorkflow but noting seems to change.

How can I stop such a run or even avoid the Failed to fetch monitoring messages phase?

score 1 · Answer 1 · answered Apr 06 '20 at 22:31

Since there is not much log to debug, this issue might be related to lineage computation bug which fails for certain cases.

There is a bug (https://issues.cask.co/browse/CDAP-16356) that causes the lineage computation to get out of hand for certain pipelines. This usually manifests itself as a pipeline that stays in the running state forever, and not in a failed pipeline. Is that the behavior you are seeing, or is it actually dying and going into the failed state?

If it's dying, it could be running out of memory, in which case you can try increasing the driver memory. You can do this from the pipeline detail page -> configure -> resources -> driver memory.

If it's stuck, you will have to delete the dataproc cluster manually. You can see the name of the cluster at the start of the logs. Unfortunately there is not much you can do to make the lineage run faster until the upcoming 6.1.2 release. The only thing is to restructure the pipeline to try and reduce the lineage computation. We have seen that Wrangler nodes and Spark nodes tend to exacerbate these issues, so restructuring usually involves combining these types of nodes when possible.

Can't get Cloud Data Fusion run to stop

1 Answers1