I have a spark(2.4) job failed with exception saying "org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 5252 tasks is bigger than spark.driver.maxResultSize"
Here's my code snippet, it involves two dataframe join
val df_a = ... //load from HDFS
val df_b = ... //load from HDFF
val a_deduped = df_a.dropDuplicates("id")
val a_duplicates = df.exceptAll(a_deduped)
val duplicates = a_deduped.join(df_b, col("id")===col("history_id"), "left_outer").where(col("history_id").isNotNull)
val df_c = a_deduped.union(duplicates)
df_c.count
The code triggers this failure is df_c.count
.
Just wondering how dataframe count work? My understanding is that it sums number of rows for every partition, and it returns an integer to driver, hence the data transfer to driver should be minimal. But why dirver.maxResultSize limitation is met? Any idea?