9

I have a dataflow streaming job with Pub/Sub subscription as an unbounded source. I want to know at what stage does dataflow acks the incoming pub/sub message. It appears to me that the message is lost if an exception is thrown during any stage of the dataflow pipeline.

Also I'd like to know how to the best practices for writing dataflow pipeline with pub/sub unbounded source for message retrieval on failure. Thank you!

Kakaji
  • 1,421
  • 2
  • 15
  • 23

1 Answers1

7

The Dataflow Streaming Runner acks pubsub messages received by a bundle after the bundle has succeeded and results of the bundle (outputs and state mutations etc) have been durably committed. Failed bundles are retried until they succeed, and don't cause data loss. If you believe that data loss may be happening, please include details (job id and your reasoning that lead you to conclude that data has been dropped because of the failures) and we'll investigate.

jkff
  • 17,623
  • 5
  • 53
  • 85
  • Here's the job id, `2017-10-12_19_26_32-4234684930060241078`. You can see in the console that there is a stage that failed and hence has nothing displayed in its 'Output collections' section. I wasn't able to receive the lost data again via a new dataflow job (after cancelling this one). I couldn't receive the data using pub/sub client library either. – Kakaji Oct 14 '17 at 23:19
  • 1
    Hmm, yes, if you cancel the pipeline then all intermediate data in the pipeline is lost. When dataflow ingests data into the pipeline, it durably stores it and protects against data loss in case of transient errors, but pipeline cancellation is another matter. I suppose you'd like messages to be acked when they've been "fully processed" by the entire pipeline, but this concept is nearly impossible to define in a general way. Basically, in case of failures, if you want to preserve the data, either use Update feature update the pipeline with non-failing code, or use Drain to cancel gracefully. – jkff Oct 15 '17 at 00:31
  • 1
    I'm making a pipeline like "read_from_pubsub->process_message->send_outside". In "send_outside", if I got an exception like 50x error from the end point, I record the error to log and raise the exception again for Dataflow to catch. I've having a problem that the all steps stopped working after the exception is raised. How can I properly return the send_outside function?(other than raising the exception?) – sees Dec 21 '20 at 06:16
  • @sees did you find a solution for this ? I have a similar case. I wonder in this case how the message is acked, because its not "durably committed" as explained above. – Ashika Umanga Umagiliya Feb 02 '22 at 01:05