0

In Google's apache-beam dataflow pipeline one can write data to a textfile, the apache-beam's website makes note of the possibility to write "multiple output files" here but the code is the same as for aggregate files (and that's all I get).

Is it possible to generate a file for each item in a PCollection?

Bastiaan
  • 4,451
  • 4
  • 22
  • 33
  • You can definitely do this with adding window function to your TextIO write. I applied interval window function on logtime field of my log. Now this is giving me one log file for saperate logs. – Jack Sep 26 '17 at 07:09
  • I am java coder so done using java . – Jack Sep 26 '17 at 07:11
  • Pasting for your help : "pushFile.apply(Window.into(new FileTextIOWindowFn())) .apply("FileTO to LOG TextIO", ParDo.of(new TextIOWriteDoFn())) .apply(TextIO.write().to(pipelineOptions.getFileStorageBucket()).withWindowedWrites() .withFilenamePolicy(new FileStorageFileNamePolicy(logTypeEnum)).withNumShards(10));" – Jack Sep 26 '17 at 07:11

1 Answers1

1

You can do this yourself by mapping the PCollection with a ParDo that takes an element and writes it to a file using the FileSystems API. The Java version of the API is here, the Python verison is here; in the Java version, you'll need to use FileSystems.open().

Note that likely your pipeline will be vulnerable to issues in case your workers fail and the work gets retried, in that case you may have leftover garbage files from failed attempts.

For a more general solution, you'll need to wait until http://s.apache.org/fileio-write which is currently being implemented and will be released in Beam Java 2.2.

jkff
  • 17,623
  • 5
  • 53
  • 85
  • With tf 1.2 installed this seems to work, however, with tf 1.3 installed the beam module is broken. With `from apache_beam.io.filesystems import FileSystems` I get '''TypeError: Error when calling the metaclass bases''' and rerunning the same line gives: "ImportError: cannot import name coders". Omitting the s in FileSystems gives the same result. – Bastiaan Sep 20 '17 at 18:52
  • Nevermind, the answer is here: https://stackoverflow.com/questions/46300173/import-apache-beam-metaclass-conflict – Bastiaan Sep 20 '17 at 19:29