2

After reading some questions on StackOverflow, I have been using the below code to read CSV files on beam.

Pipeline code:

 with beam.Pipeline(options=pipeline_options) as p:

    parsed_csv = (p | 'Create from CSV' >> beam.Create([input_file]))
    flattened_file = (parsed_csv | 'Flatten the CSV' >> beam.FlatMap(get_csv_reader))

Method to read csv: get_csv_reader()

def get_csv_reader(readable_file):

    # Open a channel to read the file from GCS
    gcs_file = beam.io.filesystems.FileSystems.open(readable_file)

    # Read file as a CSV
    gcs_reader = csv.reader(io.TextIOWrapper(gcs_file))

    next(gcs_reader)

    return gcs_reader

I am using this as opposed to ReadFromText because it fails when there are newline characters in the field values.

Question: Now, my question is if this way of reading CSV is efficient? Would it fail in case of huge files? I ask because I am using csv.reader in my method. I feel like this loads the file into memory causing a failure for huge files. Please correct my understanding if I am wrong.

Additionally, since this is a Ptransform will my method be serialized to run on different worker nodes? I am confused as to how beam would run this code behind the scenes.

If this is not the efficient please suggest the efficient way to read CSV on apache beam.

Akhil Kv
  • 139
  • 1
  • 11
  • One optimisation could be to read line by line and to create PCollection with the lines. ANd then to have another transform that extract the CSV data. Like that, you will be able to parallelize the CSV data extraction while the read line by line is very fast. If the file is huge, it's not required to fully load it in memory. – guillaume blaquiere Jun 02 '22 at 07:19

1 Answers1

2

You can define a generator to lazily read the files row by row.

def read_csv_file(readable_file):
  with beam.io.filesystems.FileSystems.open(readable_file) as gcs_file:
    for row in csv.reader(gcs_file):
      yield row

A similar question is How to handle newlines when loading a CSV into Apache Beam?

ningk
  • 1,298
  • 1
  • 7
  • 7
  • Thank you, I have some follow-ups: 1. Does this affect the performance? 2. In my code I am only returning the pointer to gcs_file back to the Flat map, correct? So where is the file being read into memory? I have some trouble understanding what is exactly going on. – Akhil Kv Jun 02 '22 at 20:03
  • 1. The performance should be the same if reading lazily in a lamba or DoFn or PTransform. 2. In your code, the reader returned is a lazy reader that you should iterate through to get the rows (by next(), for-loop, enumerate,...). Depending on your PCollection, your code returns a PCollection of lazy readers; the generator code in the answer returns a PCollection of text rows (that wait to be parsed into schema-ed data). – ningk Jun 02 '22 at 20:45
  • So since my code is returning a lazy reader, is it then as efficient as using a generator like in your answer? In both cases, the file is being read lazily. correct? Just trying to understand the difference. Also, does beam parallelize the lazy reading using the reader()? if so how? Sorry, I am asking too many questions. Thanks for your patience! – Akhil Kv Jun 02 '22 at 22:16
  • It should be as efficient. It's just your code haven't started reading the file contents into PCollections yet. Reading of each file can be executed on different workers (is parallelized) since the file names are elements of the input PCollection. But there is no magic to parallelize reading of the same file on multiple workers. – ningk Jun 02 '22 at 22:34
  • got it. So if the file is read by the same node doesn't it get into memory issues if it's huge? How and when is the file getting distributed among the workers? In my code, after returning the lazy reader to FlatMap I don't understand how Beam is able to read and transform the data into Pcollection without loading the file into memory. – Akhil Kv Jun 02 '22 at 23:59
  • So, I was testing this yesterday and this way of reading CSV seems to be very slow compared to the ReadFromText transform. Is that expected since the lines are read row by row? – Akhil Kv Jun 03 '22 at 16:26
  • 1
    Yes, it's expected because you are bottlenecked on the reader per file. If you have control of the way generating those csv files, you can split them into more files so that you can improve the parallelism when reading. – ningk Jun 03 '22 at 18:34
  • is there any way to read faster? I tried reading in chunks but each chunk seems to be read line by line too. Any other way to improve performance without splitting the file? – Akhil Kv Jun 03 '22 at 18:49
  • I'm trying beam 2.40.0, and it's not finding `beam.io.filesystems`. – dfrankow Jul 12 '22 at 13:29