I am reading a csv with 600 records using spark 2.4.2. Last 100 records have large data. I am running into the problem of,
ERROR Job aborted due to stage failure:
Task 1 in stage 0.0 failed 4 times, most recent failure:
Lost task 1.3 in stage 0.0 (TID 5, 10.244.5.133, executor 3):
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 47094.
To avoid this, increase spark.kryoserializer.buffer.max value.
I have increased the spark.kryoserializer.buffer.max
to 2g (the max allowed setting) and spark driver memory to 1g and was able to process few more records but still cannot process all the records in the csv.
I have tried paging the 600 records. e.g With 6 partition I can process 100 records per partition but since the last 100 records are huge the buffer overflow occurs.
In this case, the last 100 records are large but this can be the first 100 or records between 300 to 400. Unless I sample the data before hand to get an idea on the skew I cannot optimize the processing approach.
Is there a reason why spark.kryoserializer.buffer.max
is not allowed to go beyond 2g.
May be I can increase the partitions and decrease the records read per partition? Is it possible to use compression?
Appreciate any thoughts.