6

I have a dataset like this:

a = sc.parallelize([[1,2,3],[0,2,1],[9,8,7]]).toDF(["one", "two", "three"])

I want to have a dataset that adds a new column that is equal to the largest value in the other three columns. The output would look like this:

+----+----+-----+-------+
|one |two |three|max_col|
+----+----+-----+-------+
|   1|   2|    3|      3|
|   0|   2|    1|      2|
|   9|   8|    7|      9|
+----+----+-----+-------+

I thought I would use withColumn, like so:

b = a.withColumn("max_col", max(a["one"], a["two"], a["three"]))

but this yields the error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark152/python/pyspark/sql/column.py", line 418, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Odd. Does max return a bool? Not according to the documentation on max. Okay. Weird.

I find it odd that this works:

b = a.withColumn("max_col", a["one"] + a["two"] + a["three"]))

And the fact that it works makes me think even more strongly that max is behaving some way I don't understand.

I also tried b = a.withColumn("max_col", max([a["one"], a["two"], a["three"]])), which passes in the three columns as a list rather than 3 separte elements. This yields the same error as above.

zero323
  • 322,348
  • 103
  • 959
  • 935
Katya Willard
  • 2,152
  • 4
  • 22
  • 43

3 Answers3

14

Actually what you need here is greatest not max:

from pyspark.sql.functions import greatest

a.withColumn("max_col", greatest(a["one"], a["two"], a["three"]))

And just for completeness you can use least to find the minimum:

from pyspark.sql.functions import least

a.withColumn("min_col", least(a["one"], a["two"], a["three"]))

Regarding the error you see it is quite simple. max depends on the rich comparisons. When you compare two columns you get a Column:

type(col("a") < col("b")
## pyspark.sql.column.Column

PySpark explicitly forbids converting columns to booleans (you can check Column.__nonzero__ source) because it is simply meaningless. It is only a logical expression which cannot be evaluated in the driver context.

zero323
  • 322,348
  • 103
  • 959
  • 935
1

If I understand it right, you mismatch max of a column and max of a row. In fact .withColumn needs to receive a column, what you need is a row operation.

b=a.map(lambda row: (row.one, row.two, row.three, max(row)))

b is then a rdd, you can convert it to dataframe

b.toDF('one','two','three','max')
Alexis Benichoux
  • 790
  • 4
  • 13
1

You cannot use max from python as it doesn't return the expected pyspark.sql.Column. An example of pyspark DataFrame functions is array which builds a list from a few columns, note the return:

http://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html#array

To achieve what you need, you could write a user-defined-function like (UNTESTED)

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

def my_max(*cols):
    return max(cols)

udf_my_max = udf(my_max, IntegerType)

df.withColumn('max_col', udf_my_max(a.columns))
shuaiyuancn
  • 2,744
  • 3
  • 24
  • 32
  • 1
    Unfortunately this didn't work for me. Sure it's a small issue / bug since you didn't get a chance to test. I would prefer to stay using DataFrames rather than RDDs, so if you find a working solution I would appreciate it! – Katya Willard Jun 15 '16 at 15:47