-2

Using Scala and Spark 1.6.3, my error message is:

org.apache.spark.sql.AnalysisException: expression 'id' is neither present in the group by, nor is it an aggregate function. 
Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

Code that generates error is:

returnDf.withColumn("colName", max(col("otherCol"))

The DataFrame returnDf looks like:

+---+--------------------+
| id|            otherCol|
+---+--------------------+
|1.0|[0.0, 0.217764172...|
|2.0|          [0.0, 0.0]|
|3.0|[0.0, 0.142646382...|
|4.0|[0.63245553203367...|

There is a solution to this when using sql syntax. What is an equivalent solution using the syntax that I am using above (i.e. the withColumn() function)

10465355
  • 4,481
  • 2
  • 20
  • 44
kingledion
  • 2,263
  • 3
  • 25
  • 39
  • 3
    So you are actually looking for a maximum value in an array, aren't you? If that's the case you cannot use `max` at all (not that it could be applied to `array<>` column anyway). In 2.4 you can use higher order functions, but in 1.6 you'll have to use `udf` like `udf((xs: Seq[Double] => xs.max)`. – 10465355 Jan 10 '19 at 23:31

2 Answers2

3

You need to do a groupBy before using aggregation functions:

returnDf.groupBy(col("id")).agg(max("otherCol"))
  • 1
    To make it easier for others to read your answer, try to use code formatting for code snippets. See https://stackoverflow.com/editing-help#code for information. – Sander Apr 14 '20 at 19:25
0

The problem is that max is an aggregate function, that returns the max of a column, not the max of the array in each row in that column.

To get the max of an array, the correct solution is to use a UDF:

returnDf.withColumn("colName", udf((v : Seq[Double]) => v.max).apply(col("otherCol")))
kingledion
  • 2,263
  • 3
  • 25
  • 39