PySpark SQL: more than one row returned by a subquery used as an expression:

Question

I a trying to create a new column in my test dataframe using values from another dataframe called train. Below is a snapshot. In the train dataframe, The value in the first row under the column aml_freq_a is the number of times v appears in column a. Similarly, the 42 under aml_freq_b is the number of times l appears in b. The ['aml_freq_a', 'aml_freq_b', 'aml_freq_c'] are basically frequency columns.

>>> train.show(5)

+---+---+---+----------+----------+----------+                                  
|  a|  b|  c|aml_freq_a|aml_freq_b|aml_freq_c|
+---+---+---+----------+----------+----------+
|  v|  l|  l|        56|        42|        29|
|  u|  g|  l|        47|        46|        29|
|  s|  g|  l|        28|        46|        29|
|  v|  m|  l|        56|        33|        29|
|  h|  m|  l|        44|        33|        29|
+---+---+---+----------+----------+----------+

In the test dataset has columns ['a', 'b', 'c']. Here, I need to add the frequency columns - ['aml_freq_a', 'aml_freq_b', 'aml_freq_c'].

>>> test.show(5)
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  w|  j|  c|
|  a|  g|  w|
|  s|  d|  i|
|  g|  j|  r|
|  r|  b|  u|
+---+---+---+

To do this, I wrote subqueries that join train and test on a,b and,c.

query = "select test.*,
  (select aml_freq_a from test left join train on test.a = train.a),
  (select aml_freq_b from test left join train on test.b = train.b),
  (select aml_freq_c from ten left join train on test.c = train.c)
from test"

train.createTempView('train')
test.createTempView('test')

spark.sql(query) runs fine but when I call show() on it it returns me the following error

java.lang.RuntimeException: more than one row returned by a subquery used as an expression:

What does this mean? Initially I thought there was something wrong with my query but I validated my query here and there's nothing wrong here. What am I not seeing here?

Use dataframe join and do the joins one after the other. More than one joined row comes for each row from the test dataframe so it is impossible to create a dataframe like that. http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join — Michail N, May 21 '18 at 11:38
Also spark.sql(query) "runs" fine because it has not yet run. Spark is lazy evaluated which means that everything is computed only when an action is triggered (like show) See: https://stackoverflow.com/questions/38027877/spark-transformation-why-its-lazy-and-what-is-the-advantage — Michail N, May 21 '18 at 11:44
@MichailN yes, yes. I am aware of the lazy eval. I always call the `show` method on all my dataframes. — Clock Slave, May 21 '18 at 11:48

score 1 · Answer 1 · answered May 21 '18 at 11:23

1

It means that

At least one of the correlated subqueries you use returns more than one match.
While Spark supports only one value returned for each row (i.e. correlated subqueries have to be aggregated).

answered May 21 '18 at 11:23

user9822761

11
1

Ramesh Maharjan · Accepted Answer · 2018-05-21T12:47:51.410

In your attempt, (select aml_freq_a from test left join train on test.a = train.a) would return a dataframe i.e. multiple rows which can't be used as a select argument as "select test.*, (select aml_freq_a from test left join train on test.a = train.a),...

Correct query would be as below

query = "select test.* from " \
            "(select test.*, aml_freq_a from " \
                "(select test.*, aml_freq_b from " \
                    "(select test.*, aml_freq_c from test " \
                "left join train on test.c = train.c) as test " \
            "left join train on test.b = train.b)  as test " \
        "left join train on test.a = train.a) as test"

If the header is required in following format

+---+---+---+----------+----------+----------+
|a  |b  |c  |aml_freq_a|aml_freq_b|aml_freq_c|
+---+---+---+----------+----------+----------+

then

query = "select test.* from " \
            "(select test.*, aml_freq_c from " \
                "(select test.*, aml_freq_b from " \
                    "(select test.*, aml_freq_a from test " \
                "left join train on test.a = train.a) as test " \
            "left join train on test.b = train.b)  as test " \
        "left join train on test.c = train.c) as test"

You can do it in much simpler and safer way using dataframe api

test.join(train.select('a', 'aml_freq_a'), ['a'], 'left') \
    .join(train.select('b', 'aml_freq_b'), ['b'], 'left') \
    .join(train.select('c', 'aml_freq_c'), ['c'], 'left')

I hope the answer is helpful

I have to go with the sql one over here because this part goes into a function and using sql gives more flexibility in cases where I have more than three variables. Apart from the being easy to understand is there any other benefit of using the dataframe API? Also, you mention `simpler and safer way`. What do you mean by `safer` here? — Clock Slave, May 22 '18 at 07:11

PySpark SQL: more than one row returned by a subquery used as an expression:

2 Answers2