Will dataframe count trigger spark.drive.maxResultSize limitation?

Question

I have a spark(2.4) job failed with exception saying "org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 5252 tasks is bigger than spark.driver.maxResultSize"

Here's my code snippet, it involves two dataframe join

val df_a = ... //load from HDFS
val df_b = ... //load from HDFF
val a_deduped = df_a.dropDuplicates("id")
val a_duplicates = df.exceptAll(a_deduped)
val duplicates = a_deduped.join(df_b, col("id")===col("history_id"), "left_outer").where(col("history_id").isNotNull)
val df_c = a_deduped.union(duplicates)
df_c.count

The code triggers this failure is df_c.count.

Just wondering how dataframe count work? My understanding is that it sums number of rows for every partition, and it returns an integer to driver, hence the data transfer to driver should be minimal. But why dirver.maxResultSize limitation is met? Any idea?

your sample code snippet is needed with out which its difficult to understand what you are doing there . — Ram Ghadiyaram, Nov 13 '19 at 21:59
you might want to follow [this](https://stackoverflow.com/a/47999105/4540147) answer. Specifically try to disable `autobroadcastJoinThreshold`. — Gsquare, Nov 14 '19 at 03:09
Also `union` between `a_deduped` and `duplicates` seems incorrect since they have different schema. Change the join type to `left_semi` and remove the `where` clause — Gsquare, Nov 14 '19 at 03:13
@Gsquare my joined data is much bigger than autobroadcastJoinThreshold, then broadcast should not be enabled, but will try to disable it to see whether it helps. Regrading union, my code above is pseudo code, just trying to demonstrate the logic, thanks anyway. — yingweiw, Nov 15 '19 at 04:07

Will dataframe count trigger spark.drive.maxResultSize limitation?

0 Answers0