I am still new to Apache Beam/Cloud Dataflow so I apologize if my understanding is not correct.
I am trying to read a data file, ~30,000 rows long, through a pipeline. My simple pipeline first opened the csv from GCS, pulled the headers out of the data, ran the data through a ParDo/DoFn function, and then wrote all of the output into a csv back into GCS. This pipeline worked and was my first test.
I then edited the pipeline to read the csv, pull out the headers, remove the headers from the data, run the data through the ParDo/DoFn function with the headers as a side input, and then write all of the output into a csv. The only new code was passing the headers in as a side input and filtering it from the data.
The ParDo/DoFn function build_rows just yields the context.element so that I could make sure my side inputs were working.
The error I get is below:
I am not exactly sure what the issue is but I think it may be due to a memory limit. I trimmed my sample data down from 30,000 rows to 100 rows and my code finally worked.
The pipeline without the side inputs does read/write all 30,000 rows but in the end I will need the side inputs to do transformations on my data.
How do I fix my pipeline so that I can process large csv files from GCS and still use side inputs as a pseudo global variable for the file?