We have below spark dataframe.
and need to check ID and there name using below spark RDD script
data = spark.read.csv("DATA Present in Screenshot")
final_data = spark.createDataFrame([("","","")],["name","ID","Division"])
for id_name in data.select('ID','Name').distinct().collect():
if str(id_name).split("'")[1] == str(id_name).split("'")[-2]:
a= str(id_name).split("'")[1]
b=""
else:
a= str(id_name).split("'")[1]
b= str(id_name).split("'")[-2]
l = ['div1', 'div2','div3','div4','div5','div6']
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(Division=x,ID=a,name=b))
df_data = sqlContext.createDataFrame(people)
final_data =final_data.union(df_data)
This script working good in small dataset but for big dataset its shows below error.
message: "Total size of serialized results of 22527 tasks (1921.1 MB) is bigger than spark.driver.maxResultSize (1920.0 MB)"
Is there any way to tackle this error while modifying script.