1

I have a schedule which runs my flow twice a day - at 0910 and 1520 BST.

enter image description here

This is spawning a massive number of DataFlow jobs - so far today just the second schedule (1520) has spawned 80 jobs:

$ gcloud dataflow jobs list
JOB_ID                                    NAME                             TYPE   CREATION_TIME        STATE      REGION
2018-07-29_12_17_06-14876588186269022154  project-name-513008-by-username  Batch  2018-07-29 19:17:07  Running    us-central1
2018-07-29_12_14_54-6436458673562317581   project-name-512986-by-username  Batch  2018-07-29 19:14:55  Cancelled  us-central1
2018-07-29_12_13_55-6167618802124600084   project-name-512985-by-username  Batch  2018-07-29 19:13:57  Cancelled  us-central1
...

(see PasteBin for the full list)

In the days after the DataPrep update last week, I had trouble accessing the run settings url for the flow. I suspect that there's a process as part of the run settings which walks back through the flow (I have 12 flows chained by reference datasets) and sanity checks it - it seems that my flow was just on the cusp of being complex enough to cause the page load to time out, and I had to cut out a couple of steps just to get to the run settings.

I wonder if each time this timed out, it somehow duplicated the schedule or something else in the process - but then again, the number of duplicated jobs is inconsistent.

I recently rebuilt this project after seeing some issues with sampling errors (in that the sample was corrupt, so I couldn't load the transformation UI, but also couldn't build a new sample). After a hefty attempt at resolving the issue, I took the chance to rebuild as a dedicated GCP project with structure improvements, etc. I didn't see this scheduling error before the rebuild.

Adam Hopkinson
  • 28,281
  • 7
  • 65
  • 99
  • To better help, it would be useful to have more information regarding the Dataprep Flow, a screenshot of the flow and some details would be useful. – Nathan Nasser Jan 13 '19 at 02:13

0 Answers0