Expression 'id' is neither present in the group by, nor is it an aggregate function

Question

Using Scala and Spark 1.6.3, my error message is:

org.apache.spark.sql.AnalysisException: expression 'id' is neither present in the group by, nor is it an aggregate function. 
Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

Code that generates error is:

returnDf.withColumn("colName", max(col("otherCol"))

The DataFrame returnDf looks like:

+---+--------------------+
| id|            otherCol|
+---+--------------------+
|1.0|[0.0, 0.217764172...|
|2.0|          [0.0, 0.0]|
|3.0|[0.0, 0.142646382...|
|4.0|[0.63245553203367...|

There is a solution to this when using sql syntax. What is an equivalent solution using the syntax that I am using above (i.e. the withColumn() function)

So you are actually looking for a maximum value in an array, aren't you? If that's the case you cannot use `max` at all (not that it could be applied to `array<>` column anyway). In 2.4 you can use higher order functions, but in 1.6 you'll have to use `udf` like `udf((xs: Seq[Double] => xs.max)`. — 10465355, Jan 10 '19 at 23:31

André Nasser · Answer 1 · 2022-12-16T14:28:21.057

3

You need to do a groupBy before using aggregation functions:

returnDf.groupBy(col("id")).agg(max("otherCol"))

edited Dec 16 '22 at 14:28

answered Apr 14 '20 at 15:33

André Nasser

31
5

1

To make it easier for others to read your answer, try to use code formatting for code snippets. See https://stackoverflow.com/editing-help#code for information. – Sander Apr 14 '20 at 19:25

score 0 · Answer 2 · answered Jan 11 '19 at 15:13

The problem is that max is an aggregate function, that returns the max of a column, not the max of the array in each row in that column.

To get the max of an array, the correct solution is to use a UDF:

returnDf.withColumn("colName", udf((v : Seq[Double]) => v.max).apply(col("otherCol")))

Expression 'id' is neither present in the group by, nor is it an aggregate function

2 Answers2