0

I started seeing the following error after deploying some changes to a Spark SQL query in AWS Glue Spark 2.2.1 environment:

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 164 tasks (1031.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I tried disabling broadcast joins with set("spark.sql.autoBroadcastJoinThreshold", "-1") and increasing maxResultSize which caused other errors but the problem persisted until I replaced the following join

X left outer join Y on array_contains(X.ids, Y.id)

with

val sqlDF = spark.sql("select * from X lateral view explode(ids) t as id")
sqlDF.createOrReplaceTempView("X_exploded")
...
X_exploded left outer join Y on X_exploded.id = Y.id

I am using AWS Glue manage environment and don't have access to the query plan. However, I am curious why joining on array_contains would cause more data to be brought to the driver than exploding and using an exact match?

Note that table X contains 350KB of data in json/gzip format and table Y contains about 50GB json/zip.

Thanks!

alecswan
  • 3,670
  • 5
  • 25
  • 35

2 Answers2

0

It appears that your earlier approach is bringing all the values from Y if array_contains function returns true .

While in your later approach , explode creates new row for each element and thus eliminating any duplicates and ultimately reducing the number of rows returned.

Kaa
  • 154
  • 6
  • Why would the earlier approach bring *all* the values from Y if only *some* cause array_contains() return true? Even if X.ids contained duplicates (which it doesn't) why would that cause Y record with matching id be returned multiple times in the earlier version? I can see that happening in the later version because there each duplicate X_exploded.id will produce a separate match with Y.id. – alecswan Jan 28 '19 at 17:20
0

You can use command line, --conf spark.driver.maxResultSize=4g to increase the max result size.

Moustafa Mahmoud
  • 1,540
  • 13
  • 35
  • 1
    I did try this even though I think this would have just masked the problem :) The result was that the job failed with "Executor self-exiting due to : Driver xxx.xxx.xxx.xxx:40993 disassociated". There must be some other setting that I need to change for this to work. Or could be related to some idiosyncrasies of AWS Glue environment. – alecswan Jan 28 '19 at 17:29
  • @alecswan Please check this answer and it will solve the overall problem https://stackoverflow.com/a/29839102/2516356 but apply my answer also – Moustafa Mahmoud Jan 28 '19 at 18:55