Spark version>2. While trying to change a large pandas dataframe to spark dataframe and write to s3, got error:
Serialized task 880:0 was 665971191 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.
Tried to do a repartition to increase partition and it did not solve the problem.
Read through this Pyspark: Serialized task exceeds max allowed. Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
tried following:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("yarn")
.appName("myWork")
.config("spark.rpc.message.maxSize", "1024mb")
.getOrCreate())
Still got the problem. Any suggestion?