Spark 2.2.1: array_contains in join condition causes "bigger than spark.driver.maxResultSize" error

Question

I started seeing the following error after deploying some changes to a Spark SQL query in AWS Glue Spark 2.2.1 environment:

org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 164 tasks (1031.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I tried disabling broadcast joins with set("spark.sql.autoBroadcastJoinThreshold", "-1") and increasing maxResultSize which caused other errors but the problem persisted until I replaced the following join

X left outer join Y on array_contains(X.ids, Y.id)

with

val sqlDF = spark.sql("select * from X lateral view explode(ids) t as id")
sqlDF.createOrReplaceTempView("X_exploded")
...
X_exploded left outer join Y on X_exploded.id = Y.id

I am using AWS Glue manage environment and don't have access to the query plan. However, I am curious why joining on array_contains would cause more data to be brought to the driver than exploding and using an exact match?

Note that table X contains 350KB of data in json/gzip format and table Y contains about 50GB json/zip.

Thanks!

Quoting from my OP: "I am using AWS Glue manage environment and don't have access to the query plan." — alecswan, Jan 28 '19 at 17:14

score 0 · Answer 1 · answered Jan 28 '19 at 04:19

0

It appears that your earlier approach is bringing all the values from Y if array_contains function returns true .

While in your later approach , explode creates new row for each element and thus eliminating any duplicates and ultimately reducing the number of rows returned.

answered Jan 28 '19 at 04:19

Kaa

154
6

Why would the earlier approach bring *all* the values from Y if only *some* cause array_contains() return true? Even if X.ids contained duplicates (which it doesn't) why would that cause Y record with matching id be returned multiple times in the earlier version? I can see that happening in the later version because there each duplicate X_exploded.id will produce a separate match with Y.id. – alecswan Jan 28 '19 at 17:20

score 0 · Answer 2 · answered Jan 28 '19 at 14:13

0

You can use command line, --conf spark.driver.maxResultSize=4g to increase the max result size.

answered Jan 28 '19 at 14:13

Moustafa Mahmoud

1,540
13
35

1

I did try this even though I think this would have just masked the problem :) The result was that the job failed with "Executor self-exiting due to : Driver xxx.xxx.xxx.xxx:40993 disassociated". There must be some other setting that I need to change for this to work. Or could be related to some idiosyncrasies of AWS Glue environment. – alecswan Jan 28 '19 at 17:29
@alecswan Please check this answer and it will solve the overall problem https://stackoverflow.com/a/29839102/2516356 but apply my answer also – Moustafa Mahmoud Jan 28 '19 at 18:55

Spark 2.2.1: array_contains in join condition causes "bigger than spark.driver.maxResultSize" error

2 Answers2