Why does Spark Analyzer report "Correlated scalar subqueries must be Aggregated" in Case?

Question

A test that uses spark-sql is failing due to the following error:

Correlated scalar subqueries must be Aggregated

The relevant part of the SQL query that is failing:

COALESCE(
    CASE WHEN ndc_code IN 
    (SELECT CODE FROM BOC_MED_METADATA WHERE CODE_TYPE = 'ndc')
      THEN '180'
    END,
    CASE WHEN hcpcs_code IN 
    (SELECT CODE FROM BOC_MED_METADATA WHERE CODE_TYPE = 'hcpcs')
      THEN '180'
    END,
    med_order_end_dttm,
    CASE WHEN days_supply IS NOT NULL
      THEN
        CASE
          WHEN discontinued AND (med_order_recorded_dttm BETWEEN medication_start AND start_plus)
            THEN med_order_recorded_dttm
          WHEN discontinued AND (med_order_recorded_dttm > start_plus)
            THEN start_plus
          WHEN NOT discontinued
            THEN start_plus
        END
    END,
    medication_start
  )

The first two cases in the coalesce are what I added and caused the test to fail. I believe it is because of how the subqueries will return multiple rows but I'm not sure how to get around that. Any help is appreciated. If it makes a difference, this SQL is only run in spark for the tests, it is actually run on a snowflake warehouse in production.

Does the answer for https://stackoverflow.com/questions/40357613/getting-error-correlated-scalar-sub-queries-must-be-aggregated-global-limit-1 help? — kc2018, Apr 24 '18 at 16:35

score 1 · Answer 1 · answered May 28 '18 at 20:44

I'm confused to see the error as I don't see how the subqueries could've been considered correlated. What are they correlated on?

Anyway, to get rid of the error (which I think happens at analysis time which is before the query gets executed) is to use any standard aggregate function that assure Spark Analyzer that the queries do return just a single row with a single column or no row at all (that would give null as the result).

You could use first or max standard aggregate functions as follows:

SELECT first(CODE) FROM BOC_MED_METADATA WHERE CODE_TYPE = 'ndc'

I don't think Spark should report the exception since I can't find any correlated queries and using SQL IN should really accept any number of values from a subquery. I am confused.

score 0 · Answer 2 · answered Apr 24 '18 at 16:13

If this code:

CASE WHEN ndc_code IN 
              (SELECT CODE FROM BOC_MED_METADATA WHERE CODE_TYPE = 'ndc')
     THEN '180'

Is generating an error of the form

Correlated scalar subqueries must be Aggregated

And the error message is accurate, then either CODE or CODETYPE is missing from the table. So, my first recommendation is to qualify all column names:

CASE WHEN ndc_code IN 
              (SELECT bmm.CODE FROM BOC_MED_METADATA bmm WHERE bmm.CODE_TYPE = 'ndc')
     THEN '180'

This should return an error that one or the other columns are not found.

I qualified the columns and am still getting the same error. Here is what the stack trace looks like: `Filter (CODE#479 = outer(ndc_code#532)) +- SubqueryAlias boc_med_metadata +- Relation[id#478L,code#479,code_type#480,active_window#481] parquet` — AJwr, Apr 24 '18 at 16:47

kc2018 · Answer 3 · 2018-04-24T16:52:50.147

Instead of using IN, use EXISTS in the first two CASE statements. Try this:

COALESCE(
    CASE WHEN EXISTS( SELECT 1 FROM BOC_MED_METADATA WHERE CODE_TYPE = 'ndc' AND CODE = ndc_code) 
      THEN '180'
    END,
    CASE WHEN EXISTS( SELECT 1 FROM BOC_MED_METADATA WHERE CODE_TYPE = 'hcpcs' AND CODE = hcpcs_code)
      THEN '180'
    END,
    med_order_end_dttm,
    CASE WHEN days_supply IS NOT NULL
      THEN
        CASE
          WHEN discontinued AND (med_order_recorded_dttm BETWEEN medication_start AND start_plus)
            THEN med_order_recorded_dttm
          WHEN discontinued AND (med_order_recorded_dttm > start_plus)
            THEN start_plus
          WHEN NOT discontinued
            THEN start_plus
        END
    END,
    medication_start
  )

If the above fails, try qualifying ndc_code and hcpcs_code with the table name.

Why does Spark Analyzer report "Correlated scalar subqueries must be Aggregated" in Case?

3 Answers3