Write to dynamic destination to cloud storage in dataflow in Python

Question

I was trying to read from a big file in cloud storage and shard them according to a given field.

I'm planning to Read | Map(lambda x: (x[key field], x)) | GroupByKey | Write to file with the name of the key field.

However I couldn't find a way to write dynamically to cloud storage. Is this functionality supported?

Thank you, Yiqing

score 1 · Answer 1 · answered Feb 16 '18 at 02:42

1

Yes, you can use the FileSystems API to create the files.

answered Feb 16 '18 at 02:42

jkff

17,623
5
53
85

1

Thank you! I was using a FileSystems.create handle inside ParDo to write the grouped the results. However it seems like that GroupByKey will wait for all the data being read in and then start the writing to a single file. So I have two followup questions: 1) Can I use wildcards using FileSystems API? 2) Is there a way so that GroupBy doesn't have to wait for all the data, otherwise there might be a memory issue. Thanks again! – yiqing_hua Feb 22 '18 at 19:50

score 1 · Answer 2 · answered Aug 18 '19 at 08:13

An experimental write was added to the Beam python SDK in 2.14.0, beam.io.fileio.WriteToFiles:

my_pcollection | beam.io.fileio.WriteToFiles(
      path='/my/file/path',
      destination=lambda record: 'avro' if record['type'] == 'A' else 'csv',
      sink=lambda dest: AvroSink() if dest == 'avro' else CsvSink(),
      file_naming=beam.io.fileio.destination_prefix_naming())

which can be used to write to different files per-record.

You can skip the GroupByKey, just use destination to decide which file each record is written to. The return value of destination needs to be a value that can be grouped by.

Write to dynamic destination to cloud storage in dataflow in Python

2 Answers2