12

I use Spark 2.0.

I'd like to execute the following SQL query:

val sqlText = """
select
  f.ID as TID,
  f.BldgID as TBldgID,
  f.LeaseID as TLeaseID,
  f.Period as TPeriod,
  coalesce(
    (select
       f ChargeAmt
     from
       Fact_CMCharges f
     where
       f.BldgID = Fact_CMCharges.BldgID
     limit 1),
     0) as TChargeAmt1,
  f.ChargeAmt as TChargeAmt2,
  l.EFFDATE as TBreakDate
from
  Fact_CMCharges f
join
  CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period=EFFDateInt(l.EFFDATE) 
where
  f.ActualProjected = 'Lease'
except(
  select * from TT1 t2 left semi join Fact_CMCharges f2 on t2.TID=f2.ID) 
"""
val query = spark.sql(sqlText)
query.show()

It seems that the inner statement in coalesce gives the following error:

pyspark.sql.utils.AnalysisException: u'Correlated scalar subqueries must be Aggregated: GlobalLimit 1\n+- LocalLimit 1\n

What's wrong with the query?

ZygD
  • 22,092
  • 39
  • 79
  • 102
Kalyan
  • 1,880
  • 11
  • 35
  • 62
  • @Gotch what i have found that the concept http://stackoverflow.com/questions/33933118/does-sparksql-support-subquery the queries are not that trivial way to perform. but i have some times put the queries between 'queries.......queries' it worked for me in some situation but did not get the concrete idea here – Kalyan Dec 26 '16 at 09:05

1 Answers1

14

You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement.

So when catalyst can't make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, this exception is thrown.

If you are sure that your subquery only gives a single row you can use one of the following aggregation standard functions, so Spark Analyzer is happy:

  • first
  • avg
  • max
  • min
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Romeo Kienzler
  • 3,373
  • 3
  • 36
  • 58
  • I think these function names have to be used in upper case (FIRST, AVG, etc.) – Ganesh Jadhav Nov 13 '19 at 10:41
  • 2
    In SQL case doesn't matter, in python the need to be lower case https://gist.github.com/romeokienzler/bfebc1c84dd2846d5b36f1182608fb9e – Romeo Kienzler Nov 19 '19 at 08:47
  • 1
    Not always. Spark SQL can be configured for case sensitivity. https://stackoverflow.com/questions/42946104/enable-case-sensitivity-for-spark-sql-globally – Ganesh Jadhav Nov 19 '19 at 12:09