0

I am trying to write a large pySpark dataframe into s3 but keep getting the spark.message.rpc.maxSize error. I am reading a large amount of json data from s3, creating a dataframe out of it before doing some basic data cleaning and pre-processing steps. Dropping unwanted columns, decoding row values etc. After these steps I want to put the cleaned dataframe back in to s3 in csv format but that step is failing. The full error is as follows:

Job aborted due to stage failure: Serialized task 746:0 was 308309077 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.

Following the instructions given here, I tried to increase the maxSize when starting the cluster using the following code:

    from pyspark.sql import SparkSession
    from pyspark.sql import SQLContext
    from pyspark import SparkConf
    from pyspark import SparkContext
    
    configura=SparkConf().set('spark.rpc.message.maxSize','1024')
    sc=SparkContext.getOrCreate(conf=configura)
    spark = SparkSession.builder.getOrCreate()

But despite this I get the same error. Moreover, specifying the new maxSize also doesn't seem to have any effect as after setting the value the error still says max allowed value of spark.rpc.message.maxSize is 134217728 bytes.

For reference this is the code I am using to write the file:

final_processed.coalesce(1).write.format('csv').mode('overwrite').save("s3a://datasciences-bucket/knowledge_layers/processed_clickstream/processed_clickstream.csv")
shikhar
  • 1
  • 1
  • maybe try `spark = (SparkSession.builder.conf('spark.rpc.message.maxSize', '1024').getOrCreate())` to see if it has some effect? – pltc Apr 12 '22 at 19:45

0 Answers0