2

I'm trying to convert the first value into the second:

PCollection<KV<Integer, List<T>>>
PCollection<KV<String, List<List<T>>>>

Input values were already grouped by key. Output values should be batches of T-s, whose total length would be up to certain size. Output key would be some new key, for example "<firstInputKeyInBatch>-<totalBatchSize>".

The main goal is to save variable-length sequences to files of roughly the same size, while not splitting a sequence between different files.

Note that built-in GroupIntoBatches operates on ungrouped PCollections, and kind of splits the values under key, while I need to batch them. Its documentation says Batches will contain only elements of a single key, but how to make batches to be composed of multiple keys?

I tried other answers [1], [2], but the problem I'm facing is that @FinishBundle method is called per each original Integer key, so my batches end up being effectively the same as input. E.g. all output keys end with "-1".

private List<List<T>> batch = new ArrayList<>();
private Integer lastKey = null;
private Integer numKeys = 0;

public void processElement(KV<Integer, List<T>> input, ...) {
    batch.add(input.getValue());
    lastKey = input.getKey();
    numKeys++;
    if (batch.size() > 10) {
       // ... flush the batch ...
    }
}

String generateBatchId() {
    return lastKey + "-" + numKeys;
}

@FinishBundle
public void finishBundle(FinishBundleContext context) {
    Instant timestamp = GlobalWindow.INSTANCE.maxTimestamp();
    context.output(KV.of(generateBatchId(), batch), timestamp, GlobalWindow.INSTANCE);
    batch = new ArrayList<>();
}

[1] Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?

[2] Partition data coming from CSV so I can process larger patches rather then individual lines

I also tried using stateful processing, like in the GroupIntoBatches implementation, but the state variables are also local to an input key. Also, its implementation doesn't handle batch processing (the batch never gets outputed).

Dzmitry Lazerka
  • 1,809
  • 2
  • 21
  • 37
  • How many input items are you working with? If it's a very small amount, it might be that there's only enough for one item per batch - this is also runner dependent. I got something working with code along the lines of what you have here which started doing any batching (ie. ends with -2 or more) if the input collection was more than ~30 with directrunner. – Ryan M Jul 20 '18 at 00:52
  • @RyanM The input values can range from 1 to 1 million (somewhat exponentially distributed with long tail of small sequences). The number of keys in input PCollection is around 100k. I would want the output PCollection to have around 1000 distinct keys. I feel like the problem is that to obtain the input PCollection I used GroupByKey, which kinda defined that "bundles" correspond to keys. – Dzmitry Lazerka Jul 20 '18 at 16:56
  • To have a better view on what is happening I recommend to create logs using the org.slf4j.LoggerFactory: https://cloud.google.com/dataflow/docs/guides/logging#worker_log_message_code_example, I would log the batch IDs to make sure they are generated well. – Nathan Nasser Jan 15 '19 at 02:09

0 Answers0