2

I am reading a csv with 600 records using spark 2.4.2. Last 100 records have large data. I am running into the problem of,

ERROR Job aborted due to stage failure: 
Task 1 in stage 0.0 failed 4 times, most recent failure: 
Lost task 1.3 in stage 0.0 (TID 5, 10.244.5.133, executor 3): 
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 47094. 
To avoid this, increase spark.kryoserializer.buffer.max value.

I have increased the spark.kryoserializer.buffer.max to 2g (the max allowed setting) and spark driver memory to 1g and was able to process few more records but still cannot process all the records in the csv.

I have tried paging the 600 records. e.g With 6 partition I can process 100 records per partition but since the last 100 records are huge the buffer overflow occurs.

In this case, the last 100 records are large but this can be the first 100 or records between 300 to 400. Unless I sample the data before hand to get an idea on the skew I cannot optimize the processing approach.

Is there a reason why spark.kryoserializer.buffer.max is not allowed to go beyond 2g.

May be I can increase the partitions and decrease the records read per partition? Is it possible to use compression?

Appreciate any thoughts.

DennisLi
  • 3,915
  • 6
  • 30
  • 66
Vms
  • 199
  • 2
  • 11

2 Answers2

1

Kryo buffers are backed by byte arrays, and primitive arrays can only be up to 2GB in size.

Please refer to the below link for further details. https://github.com/apache/spark/commit/49d2ec63eccec8a3a78b15b583c36f84310fc6f0

Please increase the partition number since you cannot optimize the processing approach.

dassum
  • 4,727
  • 2
  • 25
  • 38
1

What do you have in those records that a single one blows the kryo buffer. In general leaving the partitions at default 200 should always be a good starting point. Don't reduce it to 6.

It looks like a single record (line) blows the limit. There are number of options for reading in the csv data you can try csv options If there is a single line that translates into a 2GB buffer overflow I would think about parsing the file differently. csv reader also ignores/skips some text in files (no serialization) if you give it a schema. If you remove some of the columns that are so huge from the schema it may read in the data easily.

samst
  • 536
  • 7
  • 19