2

I'm using beam.io.ReadFromText to process data from textual files.

Parsing the files is more complex than reading by lines (there is some state that needs to be carried and changed from line to line).

Can I make Beam read my file with only one processor? (not parallelized) Any other best practice for these cases?

Zach Moshe
  • 2,782
  • 4
  • 24
  • 40

1 Answers1

4

Yes, you are free to do arbitrary processing of files yourself, using the FileSystems API. This is what ReadFromText and all other file-based built-in transforms do under the hood.

def ParseFile(name):
  with FileSystems.open(name) as f:
    ... Parse the file and yield elements ...

p | beam.Create(['/path/to/file'])
  | beam.FlatMapElements(ParseFile)
jkff
  • 17,623
  • 5
  • 53
  • 85