8

I have a streaming pipeline hooked up to pub/sub that publishes filenames of GCS files. From there I want to read each file and parse out the events on each line (the events are what I ultimately want to process).

Can I use TextIO? Can you use it in a streaming pipeline when the filename is defined during execution (as opposed to using TextIO as a source and the fileName(s) are known at construction). If not I'm thinking of doing something like the following:

Get the topic from pub/sub ParDo to read each file and get the lines Process the lines of the file...

Could I use the FileBasedReader or something similar in this case to read the files? The files aren't too big so I wouldn't need to parallelize the reading of a single file, but I would need to read a lot of files.

  • We're close to having sufficient API support to create an efficient implementation of this. Please follow https://issues.apache.org/jira/browse/BEAM-2511 TextIO should support reading a PCollection of filenames. – jkff Jun 24 '17 at 18:01
  • I edited my answer to reflect the new API. – jkff Jul 12 '17 at 04:01

1 Answers1

5

You can use the TextIO.readAll() transform, which has been recently added to Beam in #3443. For example:

PCollection<String> filenames = p.apply(PubsubIO.readStrings()...);
PCollection<String> lines = filenames.apply(TextIO.readAll());

This will read all lines in each file arriving over pubsub.

jkff
  • 17,623
  • 5
  • 53
  • 85