0

We have below spark dataframe. enter image description here

and need to check ID and there name using below spark RDD script

data = spark.read.csv("DATA Present in Screenshot")
final_data = spark.createDataFrame([("","","")],["name","ID","Division"])
for id_name in data.select('ID','Name').distinct().collect():
   if str(id_name).split("'")[1] == str(id_name).split("'")[-2]:
       a= str(id_name).split("'")[1]
       b=""
   else:
       a= str(id_name).split("'")[1]
       b= str(id_name).split("'")[-2]
   l = ['div1', 'div2','div3','div4','div5','div6']
   rdd = sc.parallelize(l)
   people = rdd.map(lambda x: Row(Division=x,ID=a,name=b))
   df_data = sqlContext.createDataFrame(people)
   final_data =final_data.union(df_data)

This script working good in small dataset but for big dataset its shows below error.

message: "Total size of serialized results of 22527 tasks (1921.1 MB) is bigger than spark.driver.maxResultSize (1920.0 MB)" 

Is there any way to tackle this error while modifying script.

Amol
  • 336
  • 3
  • 5
  • 17
  • 1
    Does this answer your question? [Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)](https://stackoverflow.com/questions/47996396/total-size-of-serialized-results-of-16-tasks-1048-5-mb-is-bigger-than-spark-dr) – Kulasangar Jun 01 '20 at 08:23
  • @Kulasanagar, I have tried to add spark.conf.set("spark.driver.maxResultSize", "4g") in my script still getting error ::Futures timed out after [10 seconds] I think we have to modify script – Amol Jun 03 '20 at 04:24
  • Guys anyone has any suggestion – Amol Jun 06 '20 at 09:03

0 Answers0