Using MatchFiles() in apache beam pipeline to get file name and parse json in python

Question

I have a lot of json files in a bucket and using python 3 I want to get the file name and then create a key value pair of the files and read them. Match files is now working for python I believe but I was wondering how I would implement this:

files = p | fileio.MatchFiles("gs://mybuckenumerate/*.json") 
    | #Ideally want to create a tuple of filename, json row which I will pass into a ParDo that is a custom class that parses the json

Goal is let's say I had 10 files in a bucket:

gs://mybucket/myfile1.json
gs://mybucket/myfile2.json

And the json files in the bucket all share the same structure

I pass it into the custom ParseFile class (I think via ParDo, my apache beam knowledge is limited) and for each row in the json I am outputting a dictionary (which I will save to a newline delimited json) where one of the keys is the filename.

Edit 9/24 11:15 am pst: here is what i tried

file_content_pairs = (p 
                | fileio.MatchFiles(known_args.input_bucket)
                | fileio.ReadMatches()
                | beam.Map(lambda file: (file.metadata.path, json.loads(file.read_utf8())))
                | beam.ParDo(TestThis())
                )

TestThis() is just supposed to print the content:

class TestThis(beam.DoFn):

    def process(self, element):
        print(element)
        print("stop")
        yield element

But all that i am seeing in my output is: INFO:root:Finished listing 2 files in 1.2762866020202637 seconds.

I am not sure what the trouble is. I am able to reproduce this. What are you passing in `--input_bucket`? — Pablo, Sep 24 '19 at 18:38
through your question i got it - it was because i was passing in a folder within a bucket - i.e. gs://mybucket/myfolder/*.json and i guess it wasn't getting the file — WIT, Sep 24 '19 at 18:59

Pablo · Accepted Answer · 2019-09-27T17:48:23.930

3

I did not understand. Do you want to have key-value pairs of (filename, json-parsed-contents)?

If so, you would:

file_content_pairs = (
  p | fileio.MatchFiles("gs://mybucketname/*.json")
    | fileio.ReadMatches()
    | beam.Map(lambda file: (file.metadata.path, json.loads(file.read_utf8()))
)

So, if your file looks like so:

==============myfile.json===============
{"a": "b",
 "c": "d",
 "e": 1}

Then, your file_content_pairs collection will contain the key-value pair ("myfile.json", {"a":"b", "c": "d", "e": 1}).

If your file is in json lines format, you would do:

def consume_file(f):
  other_name = query_bigquery(f.metadata.path)
  return [(other_name, json.loads(line))
          for line in f.read_utf8().strip().split('\n')]

with Pipeline() as p:
  result = (p
            | fileio.MatchFiles("gs://mybucketname/*.json")
            | fileio.ReadMatches()
            | beam.FlatMap(consume_file))

edited Sep 27 '19 at 17:48

answered Sep 24 '19 at 17:51

Pablo

10,425
1
44
67

that's exactly what I'm looking for, i tried this and then added one more step after to parse the results (currently it just prints them) and it doesn't seem to be working – WIT Sep 24 '19 at 18:16
file_content_pairs = (p | fileio.MatchFiles(known_args.input_bucket) | fileio.ReadMatches() | beam.Map(lambda file: (file.metadata.path, json.loads(file.read_utf8()))) | beam.ParDo(TestThis()) ) – WIT Sep 24 '19 at 18:16
Hey pablo, this answer only works if it is a json fiole but not if it is newline delimited and i want to load each file row so it can be parallelized – WIT Sep 27 '19 at 14:57
for example ==============myfile.json=============== {"a": "b", "c": "d","e": 1}, \n {"a": "b", "c": "d","e": 1} – WIT Sep 27 '19 at 14:58
Do you have any ideas for that scenario? It is ok if it has to be a different load line as i can do an if statement – WIT Sep 27 '19 at 14:58
I've added a section with that – Pablo Sep 27 '19 at 16:28
Does this selection preserve getting the file name? – WIT Sep 27 '19 at 16:32
oops. now it does. – Pablo Sep 27 '19 at 16:36
Sorry to add one more caveat - I applogize. I was calling a function on f.metadata.path which queries bigquery and maps the file name to something. I didn’t want to have to query BQ for every row in the file, is it possible to read the file in the transform after the f.metadata.path step so I just invoke this function once per file? – WIT Sep 27 '19 at 16:56
hmmm it's not even working, doesnt even go into the consume_file function, and then it just says Finished listing x files in ... seconds – WIT Sep 27 '19 at 17:29
Do you think there's a way to accomplish this by tagging a pcollection? (i havent done that, but was reading about it) – WIT Sep 27 '19 at 17:31
I don't know what you mean by 'tagging a pcollection'. As for why it doesn''t work -- I do not know, but you should be able to start from the first example, and develop a function that queries bq and returns the lines in the file – Pablo Sep 27 '19 at 17:40
yeah it doesnt work and dataflow is super difficult to debug so it's hard to see why it isn't invoking the function – WIT Sep 27 '19 at 17:43
you can run it locally and see what's going on. note that the current implementation loads the whole file into memory. That may be an issue. – Pablo Sep 27 '19 at 17:45
ran it locally, seems like it's working for json files but not jsonl. Thanks for your help! i will debug – WIT Sep 27 '19 at 17:46
Haha yeah I also made that edit before seeing this. Thank you! Will update you – WIT Sep 27 '19 at 18:00
Is there a way in the FlatMap(consume_file) to yield each row as i parse it as if there are a lot of rows the dataflow job is getting hung up on this step, i'm not sure if that would be fixed if i increase num_workers or if it's a bottleneck because i am waiting for every line in the json to be loaded so as it stands now it's not something that can be divided? – WIT Oct 02 '19 at 17:42
Instead of using `file.read_utf8()`, which reads the whole file, you can use `file.open()`, which returns an iterable that returns each line one by one. – Pablo Oct 02 '19 at 19:33

Using MatchFiles() in apache beam pipeline to get file name and parse json in python

1 Answers1