pyspark, referencing the outer query are not supported outside of WHERE

Question

I need to join 2 table in pyspark and do this join not on exact value from right table, but on nearest value (as there is no exact match.

It works fine in regular SQL, but does not work in SparkSQL. I am using Spark 2.2.1

In regular SQL :

SELECT a.*,
(SELECT b.field2 FROM tableB b 
WHERE b.field1 = a.field1 
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) as field2
FROM tableA a
ORDER BY a.field1

Working fine

in SparkSQL:

...
tableA_DF.registerTempTable("tableA")
tableB_DF.registerTempTable("tableB")

query = "SELECT a.*, \
(SELECT b.field2 FROM tableB b \
WHERE b.field1 = a.field1 \
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) field2 \
FROM tableA a \
ORDER BY a.field1"

result_DF = spark.sql(query)

I am having the following exception:

pyspark.sql.utils.AnalysisException: u'Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses

If Spark 2.2.1 does not support it what would be the work around?

Thank you in advance, Gary

bsplosion · Answer 1 · 2018-12-20T14:52:45.277

Your best bet is to refactor the query such that your field2 value is output from a joined table rather than a scalar correlated subquery. For example:

...
tableA_DF.registerTempTable("tableA")
tableB_DF.registerTempTable("tableB")

query = "SELECT a.*, \
FIRST(b.field2) OVER (ORDER BY ABS(b.field2 - a.field2)) field2 \
FROM tableA a \
JOIN tableB b
ON a.field1 = b.field1 \
ORDER BY a.field1"

result_DF = spark.sql(query)

Catalyst is able to resolve the query plan when written in this manner. Note that the above is tested on Spark 2.3.1 - some recently-introduced windowing capabilities may be required for this to work.

For anyone who may not have the luxury of redefining JOIN or WHERE clauses, Spark 2.4 may include some new functionality for correlated subquery processing: https://issues.apache.org/jira/browse/SPARK-18455

Update: unfortunately, the correlated subquery mentioned in SPARK-18455 slipped to target version 3.0.0 on 9/11/18. At this point, it's very unlikely we'll see a 2.x release containing that feature, and historically, Spark minor features spec'd for base major releases have slipped off the roadmap. I'd consider correlated subqueries mostly off the Spark team's roadmap for the moment.

pyspark, referencing the outer query are not supported outside of WHERE

1 Answers1