I have more than 50 datafusion pipelines running concurrently in an Enterprise istance of DataFusion. About 4 of them randomly fail at each concurrent run, showing in the logs only the operation of provision followed by the deprovision of the Dataproc cluster, as in this log:
2021-04-29 12:52:49,936 - INFO [provisioning-service-4:i.c.c.r.s.p.d.DataprocProvisioner@203] - Creating Dataproc cluster cdap-fm-smartd-cc94285f-a8e9-11eb-9891-6ea1fb306892 in project project-test, in region europe-west2, with image 1.3, with system labels {goog-datafusion-version=6_1, cdap-version=6_1_4-1598048594947, goog-datafusion-edition=enterprise}
2021-04-29 12:56:08,527 - DEBUG [provisioning-service-1:i.c.c.i.p.t.ProvisioningTask@116] - Completed PROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.
2021-04-29 13:04:01,678 - DEBUG [provisioning-service-7:i.c.c.i.p.t.ProvisioningTask@116] - Completed DEPROVISION task for program run program_run:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow.cc94285f-a8e9-11eb-9891-6ea1fb306892.
When a failed pipeline is restarted it completes the execution with success. All the pipeline are started and monitored via Composer using async start and custom wait SensorOperator. There is no warning of quota exceeded.
Additional info: Data Fusion 6.1.4 with Dataporc ephemeral cluster with 1 master 2 workers. Image version 1.3.89
EDIT
The appfabric log releted to each failed pipeline are:
WARN [program.status:i.c.c.i.a.r.d.DistributedProgramRuntimeService@172] - Twill RunId does not exist for the program program:default.[pipeline_name].-SNAPSHOT.workflow.DataPipelineWorkflow, runId f34a6fb4-acb2-11eb-bbb2-26edc49aada0
WARN [pool-11-thread-1:i.c.c.i.a.s.RunRecordCorrectorService@141] - Fixed RunRecord for program run program_run:default.[piepleine_name].-SNAPSHOT.workflow.DataPipelineWorkflow.fdc22f56-acb2-11eb-bbcf-26edc49aada0 in STARTING state because it is actually not running
Further research connected somehow the problem to an inconsistent state in the CDAP run records, when many concurrent requests (via REST API) are made.