2

I am still new to Apache Beam/Cloud Dataflow so I apologize if my understanding is not correct.

I am trying to read a data file, ~30,000 rows long, through a pipeline. My simple pipeline first opened the csv from GCS, pulled the headers out of the data, ran the data through a ParDo/DoFn function, and then wrote all of the output into a csv back into GCS. This pipeline worked and was my first test.

I then edited the pipeline to read the csv, pull out the headers, remove the headers from the data, run the data through the ParDo/DoFn function with the headers as a side input, and then write all of the output into a csv. The only new code was passing the headers in as a side input and filtering it from the data.

enter image description here enter image description here

The ParDo/DoFn function build_rows just yields the context.element so that I could make sure my side inputs were working.

The error I get is below: enter image description here
I am not exactly sure what the issue is but I think it may be due to a memory limit. I trimmed my sample data down from 30,000 rows to 100 rows and my code finally worked.

The pipeline without the side inputs does read/write all 30,000 rows but in the end I will need the side inputs to do transformations on my data.

How do I fix my pipeline so that I can process large csv files from GCS and still use side inputs as a pseudo global variable for the file?

T.Okahara
  • 1,184
  • 2
  • 14
  • 27
  • *Note: This is tested locally. I have been doing incremental tests as I add code. If it works locally, then I run it on Google Cloud Dataflow to make sure it also runs there. If it works in Cloud Dataflow then I add more code. – T.Okahara Feb 22 '17 at 18:58

1 Answers1

2

I recently coded a CSV file source for Apache Beam, and I've added it to the beam_utils PiPy package. Specifically, you can use it as follows:

  1. Install beam utils: pip install beam_utils
  2. Import: from beam_utils.sources import CsvFileSource.
  3. Use it as a source: beam.io.Read(CsvFileSource(input_file)).

In its default behavior, the CsvFileSource returns dictionaries indexed by header - but you can take a look at the documentation to decide what option you'd like to use.

As an extra, if you want to implement your own custom CsvFileSource, you need to subclass Beam's FileBasedSource:

import csv
class CsvFileSource(beam.io.filebasedsource.FileBasedSource):
  def read_records(self, file_name, range_tracker):
    self._file = self.open_file(file_name)
    reader = csv.reader(self._file)
    for i, rec in enumerate(reader):
      yield res

And you can expand this logic to parse for headers and other special behavior.

Also, as a note, this source can not be split because it needs to be sequentially parsed, so it may represent a bottleneck when processing data (though that may be okay).

Pablo
  • 10,425
  • 1
  • 44
  • 67
  • Hi Pablo, Thanks for looking at another one of my questions. I've changed my code to use the beam_utils CsvFileSource that you wrote and things seem to be working much better. I know longer have to use side inputs which was giving me trouble but could you tell me what my problem might have been? Just so I can understand what was going on. – T.Okahara Feb 22 '17 at 23:18
  • Give me a little while to check why the assertion happened. – Pablo Feb 22 '17 at 23:30
  • You need to add an __init__ where you are explicit about whatever it is splittable. I.e super(CsvFileSource, s).__init__(filename, splittable=False). If not, you risk that several workers read the same contents again and again, believing the range_tracker argument in read_records is respected. – innohead Jun 21 '17 at 14:38