I am very new to programming and Apache Beam, and I am trying to read plenty zip files on a a GCS bucket and unzip them and save again as csv on GCS.
with beam.Pipeline() as pipeline:
readable_files = (
pipeline
| beam.io.fileio.MatchFiles('path/file/patter*.zip')
| beam.io.fileio.ReadMatches()
| beam.FlatMap(unzip)
| beam.combiners.ToList())
files_and_contents = (
readable_files
| beam.io.WriteToText('new', file_name_suffix='.csv'))
An I am unzipping the files with this function
def unzip(readable_file):
print(readable_file)
input_zip=zipfile.ZipFile(readable_file.open())
yield {name: input_zip.read(name) for name in input_zip.namelist()}
I have tested it with two files only, and all lines were written as columns, here is an example. The header is a column, and all the other lines columns.