How to read multiple files in Apache Beam from GCP bucket

Question

I am trying to reading and apply some subsetting on multiple files in GCP with Apache Beam. I prepared two pipelines which work for only one file, but fail when I try them on multiple files. Apart from this, I would be handy to combine my pipelines into one if possible or is there a way to orchestrate them so that they work in order. Now the pipelines work locally, but my ultimate goal is to run them with Dataflow.

I textio.ReadFromText and textio.ReadAllFromText, but I couldn't make neither work in case of multiple files.

def toJson(file):
    with open(file) as f:
        return json.load(f)


 with beam.Pipeline(options=PipelineOptions()) as p:
       files = (p
        | beam.io.textio.ReadFromText("gs://my_bucket/file1.txt.gz", skip_header_lines = 0)
        | beam.io.WriteToText("/home/test",
                   file_name_suffix=".json", num_shards=1 , append_trailing_newlines = True))

 with beam.Pipeline(options=PipelineOptions()) as p:
lines = (p  
            | 'read_data' >> beam.Create(['test-00000-of-00001.json'])
            | "toJson" >> beam.Map(toJson)
            | "takeItems" >> beam.FlatMap(lambda line: line["Items"])
            | "takeSubjects" >> beam.FlatMap(lambda line: line['data']['subjects'])
            | beam.combiners.Count.PerElement()
            | beam.io.WriteToText("/home/items",
                   file_name_suffix=".txt", num_shards=1 , append_trailing_newlines = True))

These two pipelines work well for a single file, but I have hundred files in the same format and would like to use the advantages of parallel computing.

Is there a way to make this pipeline work for multiple files under the same directory?

Is it possible to do this within a single pipe instead of creating two different pipelines? (It is not handy to write files to worker nodes from bucket.)

Is there some metadata in the filename you are trying to retain through to the filename you are writing out to? textio supports glob pattern and it can deal with compressed types directly. — Reza Rokni, Nov 12 '19 at 01:47
@RezaRokni, thanks for your comment. Could you give an example for this use case? I don't understand. There is no metadata in it. — Jonsi Billups, Nov 12 '19 at 09:49
I may not have understood your usecase, but you can use glob pattern with your text.io.ReadFromText("gs://my_bucket/*.txt'). And then use your beam.Map(toJson). — Reza Rokni, Nov 12 '19 at 12:32
beam complains that it expects num_bytes in read and when I provide num_bytes in read that it says JsonDecode error. — Jonsi Billups, Nov 12 '19 at 14:36
Ahh sorry, I had skimmed your example and missed that your trying to read in the whole file , rather than read lines from it. Have you already tried using fileio instead of textio? textio reads lines from a file delimited by newline character. fileio produces a pcollection of records representing the file and its metadata — Reza Rokni, Nov 12 '19 at 15:59
This is still a problem. I can’t seem to process more than one file in parallel in the same pipeline. — PANDA Stack, Jul 04 '21 at 00:35

score 3 · Accepted Answer · answered Nov 10 '19 at 22:47

I solved how to make it work for multiple files but couldn't make it run within a single pipeline though. I used for loop and then beam.Flatten option.

Here is my solution:

file_list = ["gs://my_bucket/file*.txt.gz"]
res_list = ["/home/subject_test_{}-00000-of-00001.json".format(i) for i in range(len(file_list))]

with beam.Pipeline(options=PipelineOptions()) as p:
    for i,file in enumerate(file_list):
       (p 
        | "Read Text {}".format(i) >> beam.io.textio.ReadFromText(file, skip_header_lines = 0)
        | "Write TExt {}".format(i) >> beam.io.WriteToText("/home/subject_test_{}".format(i),
                   file_name_suffix=".json", num_shards=1 , append_trailing_newlines = True))

pcols = []
with beam.Pipeline(options=PipelineOptions()) as p:
   for i,res in enumerate(res_list):
         pcol = (p   | 'read_data_{}'.format(i) >> beam.Create([res])
            | "toJson_{}".format(i) >> beam.Map(toJson)
            | "takeItems_{}".format(i) >> beam.FlatMap(lambda line: line["Items"])
            | "takeSubjects_{}".format(i) >> beam.FlatMap(lambda line: line['data']['subjects']))
        pcols.append(pcol)
   out = (pcols
    | beam.Flatten()
    | beam.combiners.Count.PerElement()
    | beam.io.WriteToText("/home/items",
                   file_name_suffix=".txt", num_shards=1 , append_trailing_newlines = True))

Wouldn't it be easier to use [`ReadAllFromText`](https://github.com/apache/beam/blob/v2.16.0/sdks/python/apache_beam/io/textio.py#L422) instead and pass a PCollection as [input](https://stackoverflow.com/a/55011224/6121516)? — Guillem Xercavins, Nov 11 '19 at 19:48
@GuillemXercavins, Thanks for your comment, I couldn't make it work with toJson function. I am still trying to do all this stuff within a single pipeline instead of two. To do this I tried apache_beam.io.gcp.gcsfilesystem in "toJson" function but read() failed saying that I have to give "num_bytes", this doesn't happen when I put json in it. — Jonsi Billups, Nov 12 '19 at 09:47
@JonsiBillups how do you trigger your pipeline in this case. I am doing something similar but when I do pipeline.run() it goes in an infinite loop for some reason. — Amruta Deshmukh, Jul 21 '20 at 22:44
@AmrutaDeshmukh, I trigger the run via Dataflow runner on GCP, see below: [link](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#python) — Jonsi Billups, Jul 22 '20 at 09:13

How to read multiple files in Apache Beam from GCP bucket

1 Answers1