I need to join 2 table in pyspark and do this join not on exact value from right table, but on nearest value (as there is no exact match.
It works fine in regular SQL, but does not work in SparkSQL. I am using Spark 2.2.1
In regular SQL :
SELECT a.*,
(SELECT b.field2 FROM tableB b
WHERE b.field1 = a.field1
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) as field2
FROM tableA a
ORDER BY a.field1
Working fine
in SparkSQL:
...
tableA_DF.registerTempTable("tableA")
tableB_DF.registerTempTable("tableB")
query = "SELECT a.*, \
(SELECT b.field2 FROM tableB b \
WHERE b.field1 = a.field1 \
ORDER BY ABS(b.field2 - a.field2) LIMIT 1) field2 \
FROM tableA a \
ORDER BY a.field1"
result_DF = spark.sql(query)
I am having the following exception:
pyspark.sql.utils.AnalysisException: u'Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses
If Spark 2.2.1 does not support it what would be the work around?
Thank you in advance, Gary