I'll try to answer some of your questions.
What I am truly missing here though is if I drop 50 files and this is
a streaming job like the article says(always live), then won't the
output be a windowed join of all the files?
Input (source) and output (sink) are not directly linked. So this depends on what you do in your pipeline. TextIO.watchForNewFiles
is an streaming source transform that keeps observing a given file location and keeps reading news files and outputting lines read from such files. Hence the output from this step will be a PCollection<String>
that stream lines of text read from such files.
Windowing is set next, this decides how your data will be bundled into Windows. For this pipeline, they choose to use FixedWindows
of 1 minute. Timestamp will be the time the file was observed.
Sink transform is applied at the end of your pipeline (sometimes sinks also produce outputs, so it might not really be the end). In this case they choose TextIO.write()
which writes lines of Strings from an input PCollection<String>
to output text files.
So whether the output will include data from all input files or not depends on how your input files are processed and how they are bundled into Windows within the pipeline.
I also read something about 'Bounded PCollections'. In that case,
perhaps windowing is not needed as inside the stream it is sort of
like a batch of until we have the entire Pcollection processed, we do
not move to the next stage? Perhaps if the article is using bounded
pcollcation, then all input files map 1 to 1 with output files?
You could use bounded inputs in a streaming pipeline. In a streaming pipeline, the progression is tracked through a watermark function. If you use a bounded input (for example, a bounded source) the watermark will just go from 0 to infinity instead of progressing gradually. Hence your pipeline might just end instead of waiting for more data.
How can one tell from inside a function if I am receiving data from a
bounded or unbounded collection? Is there some other way I can tell
that? Is bounded collections even possible in apache beam streaming
job?
It is definitely possible as I mentioned above. If you have access to the input PCollection, you can use the isBounded function to determine if it is bounded. See here for an example. You have access to input PCollections when expanding PTransform
s (hence during job submission). I don't believe you have access to this at runtime.