0

I a trying to create a new column in my test dataframe using values from another dataframe called train. Below is a snapshot. In the train dataframe, The value in the first row under the column aml_freq_a is the number of times v appears in column a. Similarly, the 42 under aml_freq_b is the number of times l appears in b. The ['aml_freq_a', 'aml_freq_b', 'aml_freq_c'] are basically frequency columns.

>>> train.show(5)

+---+---+---+----------+----------+----------+                                  
|  a|  b|  c|aml_freq_a|aml_freq_b|aml_freq_c|
+---+---+---+----------+----------+----------+
|  v|  l|  l|        56|        42|        29|
|  u|  g|  l|        47|        46|        29|
|  s|  g|  l|        28|        46|        29|
|  v|  m|  l|        56|        33|        29|
|  h|  m|  l|        44|        33|        29|
+---+---+---+----------+----------+----------+

In the test dataset has columns ['a', 'b', 'c']. Here, I need to add the frequency columns - ['aml_freq_a', 'aml_freq_b', 'aml_freq_c'].

>>> test.show(5)
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  w|  j|  c|
|  a|  g|  w|
|  s|  d|  i|
|  g|  j|  r|
|  r|  b|  u|
+---+---+---+

To do this, I wrote subqueries that join train and test on a,b and,c.

query = "select test.*,
  (select aml_freq_a from test left join train on test.a = train.a),
  (select aml_freq_b from test left join train on test.b = train.b),
  (select aml_freq_c from ten left join train on test.c = train.c)
from test"

train.createTempView('train')
test.createTempView('test')

spark.sql(query) runs fine but when I call show() on it it returns me the following error

java.lang.RuntimeException: more than one row returned by a subquery used as an expression:

What does this mean? Initially I thought there was something wrong with my query but I validated my query here and there's nothing wrong here. What am I not seeing here?

Clock Slave
  • 7,627
  • 15
  • 68
  • 109
  • Can you edit your question to show us the test data frame? – Michail N May 21 '18 at 11:25
  • @MichailN, I have edited to include it. – Clock Slave May 21 '18 at 11:28
  • 1
    Use dataframe join and do the joins one after the other. More than one joined row comes for each row from the test dataframe so it is impossible to create a dataframe like that. http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join – Michail N May 21 '18 at 11:38
  • 1
    Also spark.sql(query) "runs" fine because it has not yet run. Spark is lazy evaluated which means that everything is computed only when an action is triggered (like show) See: https://stackoverflow.com/questions/38027877/spark-transformation-why-its-lazy-and-what-is-the-advantage – Michail N May 21 '18 at 11:44
  • @MichailN yes, yes. I am aware of the lazy eval. I always call the `show` method on all my dataframes. – Clock Slave May 21 '18 at 11:48

2 Answers2

1

It means that

  • At least one of the correlated subqueries you use returns more than one match.
  • While Spark supports only one value returned for each row (i.e. correlated subqueries have to be aggregated).
1

In your attempt, (select aml_freq_a from test left join train on test.a = train.a) would return a dataframe i.e. multiple rows which can't be used as a select argument as "select test.*, (select aml_freq_a from test left join train on test.a = train.a),...

Correct query would be as below

query = "select test.* from " \
            "(select test.*, aml_freq_a from " \
                "(select test.*, aml_freq_b from " \
                    "(select test.*, aml_freq_c from test " \
                "left join train on test.c = train.c) as test " \
            "left join train on test.b = train.b)  as test " \
        "left join train on test.a = train.a) as test"

If the header is required in following format

+---+---+---+----------+----------+----------+
|a  |b  |c  |aml_freq_a|aml_freq_b|aml_freq_c|
+---+---+---+----------+----------+----------+

then

query = "select test.* from " \
            "(select test.*, aml_freq_c from " \
                "(select test.*, aml_freq_b from " \
                    "(select test.*, aml_freq_a from test " \
                "left join train on test.a = train.a) as test " \
            "left join train on test.b = train.b)  as test " \
        "left join train on test.c = train.c) as test"

You can do it in much simpler and safer way using dataframe api

test.join(train.select('a', 'aml_freq_a'), ['a'], 'left') \
    .join(train.select('b', 'aml_freq_b'), ['b'], 'left') \
    .join(train.select('c', 'aml_freq_c'), ['c'], 'left')

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • This is helpful. Thanks. – Clock Slave May 22 '18 at 07:07
  • 1
    I have to go with the sql one over here because this part goes into a function and using sql gives more flexibility in cases where I have more than three variables. Apart from the being easy to understand is there any other benefit of using the dataframe API? Also, you mention `simpler and safer way`. What do you mean by `safer` here? – Clock Slave May 22 '18 at 07:11
  • I meant was type safety. :) – Ramesh Maharjan May 22 '18 at 07:23