0

How do I output to multiple files from PCollection<KV<String, String>>?

The key in each entry is the file name. The groupByKey transformation gives me PCollection<KV<String, Iterable<String>>>, but how I can write them to multiple files?

For example, given the following input

<file1, value1>
<file2, value2>
<file1, value3>

I'd like to output two files

file1:
  value1
  value3

file2:
  value2
mlwei
  • 1
  • This is available now, via TextIO.write().to(DynamicDestinations). See https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L571 – jkff Aug 02 '17 at 18:47

1 Answers1

2

Dataflow currently does not have a transform that can do this for you. As a work-around, you can do this using a simple DoFn that will extract the filename from the KV, open the file using IOChannelFactory, and write the Iterable<String> to it.

See similar question and another one.

We have plans to address this https://issues.apache.org/jira/browse/BEAM-92, but no concrete timeline yet.

Community
  • 1
  • 1
jkff
  • 17,623
  • 5
  • 53
  • 85