I have a lot of json files in a bucket and using python 3 I want to get the file name and then create a key value pair of the files and read them. Match files is now working for python I believe but I was wondering how I would implement this:
files = p | fileio.MatchFiles("gs://mybuckenumerate/*.json")
| #Ideally want to create a tuple of filename, json row which I will pass into a ParDo that is a custom class that parses the json
Goal is let's say I had 10 files in a bucket:
gs://mybucket/myfile1.json
gs://mybucket/myfile2.json
And the json files in the bucket all share the same structure
I pass it into the custom ParseFile class (I think via ParDo, my apache beam knowledge is limited) and for each row in the json I am outputting a dictionary (which I will save to a newline delimited json) where one of the keys is the filename.
Edit 9/24 11:15 am pst: here is what i tried
file_content_pairs = (p
| fileio.MatchFiles(known_args.input_bucket)
| fileio.ReadMatches()
| beam.Map(lambda file: (file.metadata.path, json.loads(file.read_utf8())))
| beam.ParDo(TestThis())
)
TestThis() is just supposed to print the content:
class TestThis(beam.DoFn):
def process(self, element):
print(element)
print("stop")
yield element
But all that i am seeing in my output is: INFO:root:Finished listing 2 files in 1.2762866020202637 seconds.