2

What would be the simplest way to process all the records that were mapped to a specific key and output multiple records for that data.

For example (a synthetic example), assuming my key is a date and the values are intra-day timestamps with measured temperatures. I'd like to classify the temperatures into high/average/low within the day (again, below/above 1 stddev from average).

The output would be the original temperatures with their new classifications.

Using Combine.PerKey(CombineFn) allows only one output per key using the #extractOutput() method.

Thanks

G B
  • 755
  • 6
  • 16

2 Answers2

5

CombineFns are restricted to a single output value because that allows the system to do additional parallelization: combining different subsets of the values separately, and then combining their intermediate results in an arbitrary tree reduction pattern, until a single result value is produced for each key.

If your values per key don't fit in memory (so you can't use the GroupByKey-ParDo pattern that Jeremy suggests) but the computed statistics do fit in memory, you could also do something like this: (1) Use Combine.perKey() to calculate the stats per day (2) Use View.asIterable() to convert those into PCollectionViews. (3) Reprocess the original input with a ParDo that takes the statistics as side inputs (4) In that ParDo's DoFn, have startBundle() take the side inputs and build up an in-memory data structure mapping days to statistics that can be used to do lookups in processElement.

Frances
  • 3,893
  • 2
  • 13
  • 14
  • Thanks, this is what I ended up doing as I also wanted to have the stats as a separate output. – G B Dec 29 '14 at 08:15
1

Why not use a GroupByKey operation followed by a ParDo? The GroupBy would group all the values with a given key. Applying a ParDo then allows you to process all the values with a given key. Using a ParDo you can output multiple values for a given key.

In your temperature example, the output of the GroupByKey would be a PCollection of KV<Integer, Iterable<Float>> (I'm assuming you use an Integer to represent the Day and Float for the temperature). You could then apply a ParDo to process each of these KV's. For each KV you could iterate over the Float's representing the temperature and compute the hi/average/low temperatures. You could then classify each temperature reading using those stats and output a record representing the classification. This assumes the number of measurements for each Day is small enough as to easily fit in memory.

Jeremy Lewi
  • 6,386
  • 6
  • 22
  • 37