3

I just read this article

https://medium.com/bb-tutorials-and-thoughts/how-to-create-a-streaming-job-on-gcp-dataflow-a71b9a28e432

What I am truly missing here though is if I drop 50 files and this is a streaming job like the article says(always live), then won't the output be a windowed join of all the files?

If not, what would it look like and how would it change to be a windowed join? I am trying to get a picture of my head of both worlds of

  • A windowed join in a streaming job(outputting 1 file for all the files input)
  • A not windowed join in a streaming job(outputting 1 file PER input file)

Can anyone shed light on that article and what would change?

I also read something about 'Bounded PCollections'. In that case, perhaps windowing is not needed as inside the stream it is sort of like a batch of until we have the entire Pcollection processed, we do not move to the next stage? Perhaps if the article is using bounded pcollcation, then all input files map 1 to 1 with output files?

How can one tell from inside a function if I am receiving data from a bounded or unbounded collection? Is there some other way I can tell that? Is bounded collections even possible in apache beam streaming job?

Dean Hiller
  • 19,235
  • 25
  • 129
  • 212

1 Answers1

1

I'll try to answer some of your questions.

What I am truly missing here though is if I drop 50 files and this is a streaming job like the article says(always live), then won't the output be a windowed join of all the files?

Input (source) and output (sink) are not directly linked. So this depends on what you do in your pipeline. TextIO.watchForNewFiles is an streaming source transform that keeps observing a given file location and keeps reading news files and outputting lines read from such files. Hence the output from this step will be a PCollection<String> that stream lines of text read from such files.

Windowing is set next, this decides how your data will be bundled into Windows. For this pipeline, they choose to use FixedWindows of 1 minute. Timestamp will be the time the file was observed.

Sink transform is applied at the end of your pipeline (sometimes sinks also produce outputs, so it might not really be the end). In this case they choose TextIO.write() which writes lines of Strings from an input PCollection<String> to output text files.

So whether the output will include data from all input files or not depends on how your input files are processed and how they are bundled into Windows within the pipeline.

I also read something about 'Bounded PCollections'. In that case, perhaps windowing is not needed as inside the stream it is sort of like a batch of until we have the entire Pcollection processed, we do not move to the next stage? Perhaps if the article is using bounded pcollcation, then all input files map 1 to 1 with output files?

You could use bounded inputs in a streaming pipeline. In a streaming pipeline, the progression is tracked through a watermark function. If you use a bounded input (for example, a bounded source) the watermark will just go from 0 to infinity instead of progressing gradually. Hence your pipeline might just end instead of waiting for more data.

How can one tell from inside a function if I am receiving data from a bounded or unbounded collection? Is there some other way I can tell that? Is bounded collections even possible in apache beam streaming job?

It is definitely possible as I mentioned above. If you have access to the input PCollection, you can use the isBounded function to determine if it is bounded. See here for an example. You have access to input PCollections when expanding PTransforms (hence during job submission). I don't believe you have access to this at runtime.

chamikara
  • 1,896
  • 1
  • 9
  • 6
  • If I am using a fixed window (of 2 minutes) to aggregate incoming data from a streaming source (let's say I am reading pubsub messages) and then write them to BigQuery - then the Pcollection that is coming out of the fixed window and goes into the BigQuery considered `BOUNDED`? – Daniel Mar 30 '22 at 09:47