After reading some questions on StackOverflow, I have been using the below code to read CSV files on beam.
Pipeline code:
with beam.Pipeline(options=pipeline_options) as p:
parsed_csv = (p | 'Create from CSV' >> beam.Create([input_file]))
flattened_file = (parsed_csv | 'Flatten the CSV' >> beam.FlatMap(get_csv_reader))
Method to read csv: get_csv_reader()
def get_csv_reader(readable_file):
# Open a channel to read the file from GCS
gcs_file = beam.io.filesystems.FileSystems.open(readable_file)
# Read file as a CSV
gcs_reader = csv.reader(io.TextIOWrapper(gcs_file))
next(gcs_reader)
return gcs_reader
I am using this as opposed to ReadFromText because it fails when there are newline characters in the field values.
Question: Now, my question is if this way of reading CSV is efficient? Would it fail in case of huge files? I ask because I am using csv.reader in my method. I feel like this loads the file into memory causing a failure for huge files. Please correct my understanding if I am wrong.
Additionally, since this is a Ptransform will my method be serialized to run on different worker nodes? I am confused as to how beam would run this code behind the scenes.
If this is not the efficient please suggest the efficient way to read CSV on apache beam.