How to run apache-beam in batches on a bounded data?

Question

I am trying to understand how the apache beam works and im not quite sure if i do. So, i want someone to tell me if my understanding is right:

Beam is a layer of abstraction over big data frameworks like spark,hadoop,google data flow etc. Now quite every functionality but almost that is the case
Beam treats data in two forms bounded and unbounded. Bounded like a .csv and unbounded like a kafka subscription. There are different i/o read methods for each. For unbounded data we need to implement windowing (attaching a timestamp to each data point) and trigger (a timestamp). A batch here would be all the datapoints in a window till a trigger is hit. For the bounded datasets however, all the dataset is loaded in RAM (? if yes, how do i make beam work on batches?). The output of a i/o method is a pCollection
There are pTransformations (these are the operations i want run on the data) that apply to each element of the of the pCollection. I can make these pTransformations apply over a spark or flint cluster (this choice goes in the initial options set for the pipeline). each pTransformation emits a pCollection and that is how we chain various pTransformations together. End is a pCollection that can be saved to disk
End of the pipeline could be a save on some file system (How does this happen when i am reading a .csv in batches?)

Please point out to me any lapses in my understanding

score 0 · Answer 1 · answered Apr 14 '20 at 19:50

Beam is not like google cloud dataflow, Cloud Dataflow is a runner on top of Apache Beam. It executes Apache Beam pipelines. But you can run an Apache Beam job with a local runner not on the cloud. There are plenty of different runners that you can find in the documentation : https://beam.apache.org/documentation/#available-runners
One specific aspect of Beam is that it's the same pipeline for Batch and Stream and that's the purpose. You can specify --streaming as an argument to execute your pipeline, withou it it should execute it in batch. But it mostly depends on you inputs, the data will just flow into the pipeline. And that's one important point, PCollections do not contain persistent data just like RDD's for Spark RDD.
You can apply a PTransform on part of your data, it's not necessarly on all the data. All the PTranforms together forms the pipeline.
It really depends where and what format you want for your output...

Welcome to stackoverflow. Thank you for your answer. Can you explain a bit more on how can one go about computing mean (could be a moving average as well) per column of a lets say 50 GB .csv ? — figs_and_nuts, Apr 15 '20 at 05:57

1 Answers1