0

Problem:

I am writing an Apache Beam pipeline to convert Avro file to Parquet file (with Spark runner). Everything works well until I start to convert large size Avro file (15G).

The code used to read Avro file to create PColletion:

        PCollection<GenericRecord> records =
                p.apply(FileIO.match().filepattern(s3BucketUrl + inputFilePattern))
                        .apply(FileIO.readMatches())
                        .apply(AvroIO.readFilesGenericRecords(inputSchema));

The error message from my entrypoint shell script is:

b'/app/entrypoint.sh: line 42: 8 Killed java -XX:MaxRAM=${MAX_RAM} -XX:MaxRAMFraction=1 -cp /usr/share/tink-analytics-avro-to-parquet/avro-to-parquet-deploy-task.jar

Hypothesis

After some investigation, I suspect that the AvroIO code above try to load the whole Avro file as one partition, which causes OOM issue.

One hypothesis I have is: if I can specify number of partitions when reading Avro file, let's see 100 partitions for example, then each partition will contain only 150M data, which should avoid the OOM issue.

My questions are:

  1. Does this hypothesis lead me in the right direction?
  2. If so, How can I specify number of partitions while reading the Avro file?
fuyi
  • 2,573
  • 4
  • 23
  • 46
  • Hi - you can write the avro files as partitions beforehand, then the file-pattern should handle reading all files which conform with the pattern. One way to test your assumption is to increase the memory of the machine to see if it doesn't run out of RAM. But ideally you want to be reading small files in parallel, so splitting the files makes sense – Claudiu S May 29 '20 at 10:48
  • Do you have a full stack trace of NPE? If yes, could you attach it? – Alexey Romanenko May 29 '20 at 12:03
  • Hi @ClaudiuS. The avro files are partitioned so each avro file is about 100M. There is no problem to read and process 1 file, the problem happens when I load all files, that's why I suspect AvroIO tries to load all data at one shot instead of by partitions. – fuyi May 29 '20 at 18:28
  • @fuyi are you reading one huge file of 15GB? Then partition won't work here, as it's one executor which needs the exact 15GB RAM or a RAM of size close to 8GB and an ample amount of EBS volume – Nagaraj Tantri Jun 10 '20 at 01:40
  • Hi @NagarajTantri Thanks for your comment. Please see the comment above, the input is a bunch of files of size 100M. – fuyi Jun 11 '20 at 08:59

1 Answers1

0

Instead of setting number of partitions, Spark session has a property called spark.sql.files.maxPartitionBytes, which is set to 128Mb by default, see reference here.

Spark uses this number to partition input file(s) while reading them into memory.

I tested with a 50Gb avro file and Spark partitioned it to 403 partitions. This Avro to Parquet conversion worked on a Spark cluster with 16Gb Mem and 4 Cores.

fuyi
  • 2,573
  • 4
  • 23
  • 46