PySpark/Spark Correlated scalar subqueries with order by

Question

I'm trying to join a column value from a table, expl, to my main table, co, under an equality condition, in Spark SQL. The catch is that because there are many rows that join from expl, I want to only join one random row, and use its column value.

But I'm running into Correlated scalar subqueries errors either in the subquery select statement or in the order by. There aren't posts on SO that deal with the ORDER BY part of the subquery or with subquery random row retrieval in Spark.

In this case I tried to use a random number generator, which does not work as it seems like RAND() needs to be an aggregate, somehow.

cooccurrences = spark.sql("""
    SELECT C.word AS word, C.word2, (
            SELECT FIRST(e.word2) word2 FROM expl e
            WHERE e.word = C.word ORDER BY RAND() 
        ) word3
        FROM cu C
 """)

In this case, I built a column with a random value, which can be used to order the columns, so a random number generator isn't needed and instead just the row with the max value is returned. But I dislike this because a) it doesn't work because you can't use it in an aggregated order by and b) it could return the same row for duplicate groups.

cooccurrences = spark.sql("""
                              SELECT 
                                e.word, 
                                e.word2,
                                (SELECT z.word2 from 
                                  (SELECT 
                                        FIRST(c.word2) word2, MAX(C.rand_n) rand_n
                                        FROM cu C
                                        WHERE e.word = C.word and MAX
                                    ) z
                                  ) word3
                                FROM expl AS e
                          """)

These queries all throw a variant of correlated scalar subqueries must be aggregated

Edit

I ended up using a udf to solve this, which I'm not wild about.

def get_sample(word, word2):
  s = set(ii[(ii['word'] == word)]['word2'].unique()) | set(word)
  v = ii[~ii['word2'].isin(s)].sample(n=1)
  if v.empty: return ''
  return v.iloc[0]['word2']

sampler = udf(get_sample, StringType())

score 0 · Answer 1 · answered Jul 31 '20 at 21:52

0

I want to only join one random row, and use its column value.

It looks like you just want a random sort, then limit 1:

select 
    e.word, 
    e.word2,
    (select c.word2 from cu c where c.word = e.word order by rand() limit 1) word3
from expl as e

answered Jul 31 '20 at 21:52

GMB

216,147
25
84
135

This won't work in Spark. You have to use an aggregation function. See for example: https://stackoverflow.com/questions/40357613/what-does-correlated-scalar-subqueries-must-be-aggregated-mean – Union find Jul 31 '20 at 21:53

PySpark/Spark Correlated scalar subqueries with order by

1 Answers1