1

I have a dataflow job which I am trying to 'drain'. Explanation of drain option says that

Dataflow will cease all data ingestion, but will attempt to finish processing any remaining buffered data. Pipeline resources will be maintained until buffered data has finished processing and any pending output has finished writing.

But data ingestion does not seem to stop. The Elements added count is still increasing and the job hasn't stopped for over an hour now. Is this expected behavior? I am using Pub/Sub source if that helps.

EDIT: Here is the job ID - 2017-10-30_19_59_30-14251132252018661885

Kakaji
  • 1,421
  • 2
  • 15
  • 23
  • That potentially sounds like a bug. Please include a Dataflow job ID so a Dataflow engineer can help debug this. – jkff Oct 31 '17 at 05:51
  • @jkff I have added the job ID to the question. :-) – Kakaji Oct 31 '17 at 06:41
  • Thanks. Is your job stuck in a loop of retrying some work that keeps failing? In that case drain won't work, you'll need to update the pipeline with non-failing code first, or cancel it. – jkff Oct 31 '17 at 19:28
  • @jkff I am deliberately failing my job to see if I can obtain the failed data again after I fix the job. I was suggested that drain is the way to go (https://stackoverflow.com/questions/46721532/at-what-stage-does-dataflow-apache-beam-ack-a-pub-sub-message). There is no loop inside the code, it's a simple JSON decoding pipeline. As I mentioned, `Elements added` keeps on increasing in the very first `PubsubIO.Read` step even after I press `drain`. That step does not contain any code I wrote, it's a simple `PubsubIO.readStrings().fromSubscription()` command. Thanks! – Kakaji Nov 01 '17 at 05:21
  • Does the Dataflow UI show any errors or exceptions in the logs? The retry loop is in Dataflow: streaming runner treats all errors as transient and retries them forever to avoid discarding data. – jkff Nov 01 '17 at 05:48
  • Note that the linked suggestion of mine includes also a suggestion to use Update in case your pipeline is having failures. – jkff Nov 01 '17 at 05:50
  • @jkff As I said I am deliberately throwing a `RuntimeException` to see how Dataflow handles exceptions. Is this expected behavior of Dataflow? Will it never drain? I have a feeling that dataflow is fetching the failed messages again and again from Pub/Sub. Am I misunderstanding the part that says "cease all data ingestion" in the documentation? – Kakaji Nov 02 '17 at 00:47
  • Yes, a failing pipeline can not be drained. Draining requires successfully completing current processing, and a message that perpetually throws an exception when processed prevents that. – jkff Nov 02 '17 at 02:05
  • @jkff So the proper way to handle a failing job is to keep it running until I fix my code and then use `Update` to update the failing job? Would that prevent data loss? – Kakaji Nov 02 '17 at 02:48
  • Yes, this is correct, and there will be no data loss. – jkff Nov 02 '17 at 04:52
  • @jkff Thank you once again. I will test this out and add this as an answer if everything goes well. :-) – Kakaji Nov 02 '17 at 06:39

1 Answers1

1

As mentioned in the comments by @jkff, a failing job cannot be drained. The correct way to handle a failing dataflow job is to fix the code and update the job using --update option. This prevents any data loss.

Kakaji
  • 1,421
  • 2
  • 15
  • 23