Use of GroupIntoBatches on Bounded Source

Question

I have a pipeline that translates a bounded data source into a set of RPCs to a third-party system, and want to have a reasonable balance between batching requests for efficiency and enforcing a maximum batch size. Is GroupIntoBatches the appropriate transform to use in this case? Are there any concerns around efficiency in batch mode that I should be aware of?

Based on the unit tests, it appears that the "final" batch will be emitted for a bounded source (even if it doesn't make up a full batch), correct?

score 1 · Accepted Answer · answered Aug 26 '20 at 14:57

I think that GroupIntoBatches is a good approach for this use case. Keep in mind that this transform uses KV pairs and the parallelism that you want to achieve will be limited by the number of keys. I suggest taking a look at this answer.

Regarding the batch size, yes, the batches may be of a lower size if there are not enough elements. Take a look at this fun example of the beam Python documentation:

score 1 · Answer 2 · answered Aug 28 '20 at 17:18

1

GroupIntoBatches will work. If you're running a batch pipeline and don't have a natural key on which to group (making a random one will often result in batches that are too small or parallelism that is too small and can interact poorly with liquid sharding) you should consider using BatchElements instead which can batch without keys and can be configured with either a fixed or dynamic batch size.

answered Aug 28 '20 at 17:18

robertwb

4,891
18
21

Rats, no Java support yet! – Sam McVeety Aug 28 '20 at 19:16

Use of GroupIntoBatches on Bounded Source

2 Answers2