2

I am running a dataflow that I last ran a few months ago. From the same client, with same dataflow version (0.7.0dev0). Unfortunately it fails in mysterious ways that it did not do before.

I am starting the job, and the first stage is:

(8733429d016bc2fb): Executing operation read from datastore/Split Query+read from datastore/GroupByKey/Reify+read from datastore/GroupByKey/Write

But it gives the following error after 1 hour:

(e88cb3c076926976): Workflow failed. Causes: (e88cb3c07692626f): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.

if would help, JobID is 2017-08-21_00_30_03-3588685705436948852. I would upgrade to a newer version of the library, but that involves a bunch more API changes and figuring out how to get all the pieces working again. So I'm working at it now. I was hoping that "a simple use case that previously worked and currently fails" might be easier to debug than changing even-more-things.

I'm not sure how to debug or investigate further. It worked a few months ago with the same code, but doesn't work now (with a 4-5x larger dataset, of 200-300K records, nothing crazy...)

EMBarbosa
  • 1,473
  • 1
  • 22
  • 76
Mike Lambert
  • 1,976
  • 1
  • 17
  • 31
  • Could you share a job ID or any more details of your pipeline? Would it be possible to upgrade to a newer version? – Ben Chambers Aug 21 '17 at 22:02
  • Okay, things seem to work after upgrading to 2.0.0! (Required some import fixups, reworking how I download/import apache-beam, etc.) I assume there's just some bitrot on the gcloud servers not supporting the 0.7.0-dev version... – Mike Lambert Aug 22 '17 at 23:48
  • I am experiencing this exact issue, the job used to take 4-6 minutes, but the job doesnt end now, rather it just doest start, it shows `partially running` state on `GroupByKey` and `running` on `UserQuery` and `SplitQuery`. I was using 2.1.0 python SDK, tried using 2.0.0 SDK but the error still persists. how do i go about it? @BenChambers – Anuj Sep 28 '17 at 11:20
  • @BenChambers also the data i am working on hasnt changed in size, Since the job used to take 4-5 minutes i stopped all the jobs which ran more than 10 minutes, ill try to check if it shows the `workflow-failed` error – Anuj Sep 28 '17 at 11:27
  • @BenChambers So ran the job, the job failed after an hour with `The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.` Error – Anuj Sep 28 '17 at 13:15
  • 2
    Please start a new question -- since you were already on a newer SDK and likely have a different pipeline it is likely a different issue. Job IDs will also be necessary to dig in much further. – Ben Chambers Sep 28 '17 at 16:16
  • And...this has started happening for me again. A month ago it ran correctly on 2.0.0, but now seems broken again. Job ID 2017-10-11_11_55_31-16012662075482454553. Tomorrow I'll try upgrading to the latest (currently 2.1.1) and see if that helps. – Mike Lambert Oct 11 '17 at 20:09

1 Answers1

2

This was fixed by upgrading to 2.0.0 (thanks Ben Chambers!) It seems that 0.7.0 no longer worked well with cloud dataflow.

Mike Lambert
  • 1,976
  • 1
  • 17
  • 31