1

I have more than 50 datafusion pipelines running concurrently in an Enterprise istance of DataFusion. About 4 of them randomly fail at each concurrent run, showing in the logs only the operation of provision followed by the deprovision of the Dataproc cluster, as in this log:

2021-04-29 12:52:49,936 - INFO  [provisioning-service-4:i.c.c.r.s.p.d.DataprocProvisioner@203] - Creating Dataproc cluster cdap-fm-smartd-cc94285f-a8e9-11eb-9891-6ea1fb306892 in project project-test, in region europe-west2, with image 1.3, with system labels {goog-datafusion-version=6_1, cdap-version=6_1_4-1598048594947, goog-datafusion-edition=enterprise}
2021-04-29 12:56:08,527 - DEBUG [provisioning-service-1:i.c.c.i.p.t.ProvisioningTask@116] - Completed PROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.
2021-04-29 13:04:01,678 - DEBUG [provisioning-service-7:i.c.c.i.p.t.ProvisioningTask@116] - Completed DEPROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.

When a failed pipeline is restarted it completes the execution with success. All the pipeline are started and monitored via Composer using async start and custom wait SensorOperator. There is no warning of quota exceeded.

Additional info: Data Fusion 6.1.4 with Dataporc ephemeral cluster with 1 master 2 workers. Image version 1.3.89

EDIT

The appfabric log releted to each failed pipeline are:

WARN  [program.status:i.c.c.i.a.r.d.DistributedProgramRuntimeService@172] - Twill RunId does not exist for the program program:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow, runId f34a6fb4-acb2-11eb-bbb2-26edc49aada0

WARN  [pool-11-thread-1:i.c.c.i.a.s.RunRecordCorrectorService@141] - Fixed RunRecord for program run program_run:default.[piepleine_name].-SNAPSHOT.workflow.DataPipelineWorkflow.fdc22f56-acb2-11eb-bbcf-26edc49aada0 in STARTING state because it is actually not running

Further research connected somehow the problem to an inconsistent state in the CDAP run records, when many concurrent requests (via REST API) are made.

glrsmm
  • 21
  • 2
  • Could you please provide appfabric logs around the same time there were random failures? – vinisha May 03 '21 at 18:38
  • just edited with the appfabric logs – glrsmm May 04 '21 at 13:02
  • Are there any logs in appfabric logs related to `cc94285f-a8e9-11eb-9891-6ea1fb306892` run? – vinisha May 05 '21 at 02:51
  • Since I added a waiting time of 10 seconds between each pipeline start API I haven't seen the error again. If the error appear again I'll try to put the logs releted to the failed pipelines and their run. (I cannot recover the previous logs because the DataFusion instance was deleted and recreated) – glrsmm May 11 '21 at 14:47

0 Answers0