Problem:
I am writing an Apache Beam pipeline to convert Avro file to Parquet file (with Spark runner). Everything works well until I start to convert large size Avro file (15G).
The code used to read Avro file to create PColletion:
PCollection<GenericRecord> records =
p.apply(FileIO.match().filepattern(s3BucketUrl + inputFilePattern))
.apply(FileIO.readMatches())
.apply(AvroIO.readFilesGenericRecords(inputSchema));
The error message from my entrypoint shell script is:
b'/app/entrypoint.sh: line 42: 8 Killed java -XX:MaxRAM=${MAX_RAM} -XX:MaxRAMFraction=1 -cp /usr/share/tink-analytics-avro-to-parquet/avro-to-parquet-deploy-task.jar
Hypothesis
After some investigation, I suspect that the AvroIO code above try to load the whole Avro file as one partition, which causes OOM issue.
One hypothesis I have is: if I can specify number of partitions when reading Avro file, let's see 100 partitions for example, then each partition will contain only 150M data, which should avoid the OOM issue.
My questions are:
- Does this hypothesis lead me in the right direction?
- If so, How can I specify number of partitions while reading the Avro file?