0

I have 100k records to be processed and I need to fetch 10k each time, process them and fetch another 10k until I process all the 100k records which I call as batch size to reduce the processing overhead each time by fetching all the records at once.

Any suggestions on how to achieve it using Apache beam

I am using spark runner.

  • Did you ever figure this out? I'm trying to do this with Beam as well. My Sinks crash with OutOfMemoryException because they try to load everything at once. – Rahul Vaidya May 10 '19 at 01:00
  • I did it by processing the pipeline multiple times based on the batch size, due to the beam's parallel processing feature i was not able to achieve what i wanted to do, so i have implemented in another way. Your out of memory error is probably because the same iteration might be picked by all the threads at a time and processing the same based on your logic. I too have faced the problem initially – Poornima Jasti May 13 '19 at 12:02

0 Answers0