0

I am very new to programming and Apache Beam, and I am trying to read plenty zip files on a a GCS bucket and unzip them and save again as csv on GCS.

with beam.Pipeline() as pipeline:
readable_files = (
  pipeline
  | beam.io.fileio.MatchFiles('path/file/patter*.zip')
  | beam.io.fileio.ReadMatches()
  | beam.FlatMap(unzip)
  | beam.combiners.ToList())
files_and_contents = (
  readable_files  
  | beam.io.WriteToText('new', file_name_suffix='.csv'))

An I am unzipping the files with this function

def unzip(readable_file):
print(readable_file)
input_zip=zipfile.ZipFile(readable_file.open())
yield {name: input_zip.read(name) for name in input_zip.namelist()}

I have tested it with two files only, and all lines were written as columns, here is an example. The header is a column, and all the other lines columns.

CSV file saved

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62

1 Answers1

0

inside beam.io.file io.ReadMatches() try adding skip_header_lines=1

  • As it’s currently written, your answer is unclear and requires supporting information. Please edit to add additional details that will help others understand how this addresses the question asked. – JayPeerachai May 23 '22 at 16:51