Alternate method for Spark RDD using pyspark

Question

We have below spark dataframe.

and need to check ID and there name using below spark RDD script

data = spark.read.csv("DATA Present in Screenshot")
final_data = spark.createDataFrame([("","","")],["name","ID","Division"])
for id_name in data.select('ID','Name').distinct().collect():
   if str(id_name).split("'")[1] == str(id_name).split("'")[-2]:
       a= str(id_name).split("'")[1]
       b=""
   else:
       a= str(id_name).split("'")[1]
       b= str(id_name).split("'")[-2]
   l = ['div1', 'div2','div3','div4','div5','div6']
   rdd = sc.parallelize(l)
   people = rdd.map(lambda x: Row(Division=x,ID=a,name=b))
   df_data = sqlContext.createDataFrame(people)
   final_data =final_data.union(df_data)

This script working good in small dataset but for big dataset its shows below error.

message: "Total size of serialized results of 22527 tasks (1921.1 MB) is bigger than spark.driver.maxResultSize (1920.0 MB)"

Is there any way to tackle this error while modifying script.

Does this answer your question? [Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)](https://stackoverflow.com/questions/47996396/total-size-of-serialized-results-of-16-tasks-1048-5-mb-is-bigger-than-spark-dr) — Kulasangar, Jun 01 '20 at 08:23
@Kulasanagar, I have tried to add spark.conf.set("spark.driver.maxResultSize", "4g") in my script still getting error ::Futures timed out after [10 seconds] I think we have to modify script — Amol, Jun 03 '20 at 04:24

Alternate method for Spark RDD using pyspark

0 Answers0