How can I implement size based batching instead of time based for BoundedSource in apache beam/data flow?

Question

Intention: I have a bounded data set (Rest API exposed, 10k records) to be written to BigQuery with some additional steps. As my data set is bounded I've implemented BoundedSource interface to read records in my apache beam pipeline.

Problem: all 10k records are read in one shot (one shot for write to BigQiery as well). But I want to query small part (for example 200 rows), process, save to BigQuery and then query next 200 rows.

I've found that I can use windowing with bounded PCollections, but windows are created on time basis (every 10 sec for example) and I want it to be on record counter basis (every 200 records)

Question: How can I implement the mentioned splitting to batches/windows with 200 records size? Am I missing something?

The question is similar to this but it wasn't answered

score 1 · Answer 1 · answered Oct 04 '21 at 20:25

1

Given a PCollection of rows, you can use GroupIntoBatches to batch these up into a PCollection of sets of rows of a given size.

As for reading your input in an incremental way, you can use the split method of BoundedSource to shard your read into several pieces which will then be read independently (possibly on separate workers). For a bounded pipeline, this will still happen in its entirety (all 10k records read) before anything is written, but need not happen on a single worker. You could also insert a Reshuffle to decouple the parallelism between your read and your write.

answered Oct 04 '21 at 20:25

robertwb

4,891
18
21

thanks for your response. >> For a bounded pipeline, this will still happen in its entirety (all 10k records read) -> then it doesn't suit for my case and I need to implement something different. (For me there is no reason to split to butches using GroupIntoBatches if dataset is read in one shot). Do you have any clues how to implement streaming pipeline in order not to read all data in one shot? – vulpes Oct 11 '21 at 12:59
1

You can introduce a reshuffle to split the reading batching of the reading and writing operations. If your BoundedSource implements split, it can split the reading up into chunks as well. Either way you can run the pipeline in streaming mode. – robertwb Oct 13 '21 at 06:52
Can you please advise how to run job in streaming mode if I've implemented BoundedSource to start pipeline? I've tried to use enable_streaming_engine to run in gcp but it doesn't work. Should I add something to my code? Any examples would be appreciated – vulpes Oct 13 '21 at 13:43
Try passing `--streaming`. – robertwb Oct 15 '21 at 06:45
`--streaming` doesn't work. I've found this in [documentation](https://cloud.google.com/dataflow/docs/resources/faq) - `Batch sources are not yet supported in streaming mode.` Are you sure that it is possible or it's just an assumption? Please provide an example of pipeline if you have one. – vulpes Oct 16 '21 at 13:16

How can I implement size based batching instead of time based for BoundedSource in apache beam/data flow?

1 Answers1