0

I am reading one Json file in dataflow pipeline using beam.io.ReadFromText, When I pass its output to any of the class (ParDo) it will become element. I wanted to use this json file content in my class, How do I do this?

Content in Json File:

{"query": "select * from tablename", "Unit": "XX", "outputFileLocation": "gs://test-bucket/data.csv", "location": "US"}

Here I want to use each of its value like query, Unit, location and outputFileLocation in class Query():

p | beam.io.ReadFromText(file_pattern=user_options.inputFile) | 'Executing Query' >> beam.ParDo(Query())

My class:

class Query(beam.DoFn):
    def process(self, element):
        # do something using content available in element
        .........
Kaustubh Ghole
  • 537
  • 1
  • 10
  • 25

1 Answers1

0

I don't think it is possible with current set of IOs. the reason being that a multiline json requires parsing complete file to identify a single json block. This could have been possible if we had no parallelism while reading. However, as File based IOs run on multiple workers in parallel using certain partitioning logic and Line delimiter, parsing multiline json is not possible.

If you have multiple smaller files then you can probably read those files separately and emit the parsed json. You can further use a reshuffle to evenly distribute the data for the down stream operations.

The pipeline would look something like this.

Get File List -> Reshuffle -> Read content of individual files and emit the parsed json -> Reshuffle -> Do things.
Ankur
  • 759
  • 4
  • 7