Calculate maximum number of observations per group

Question

I use Spark 1.6.2.

I need to find maximum count per each group.

val myData = Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "12"),("aa3", "GROUP_B", "14"),("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_B", "12"))

val df = sc.parallelize(myData).toDF("id","type","activity")

Let's first calculate the number of observations per group:

df.groupBy("type","id").count.show

+-------+---+-----+
|   type| id|count|
+-------+---+-----+
|GROUP_A|aa1|    2|
|GROUP_A|aa2|    1|
|GROUP_B|aa2|    1|
|GROUP_B|aa3|    3|
+-------+---+-----+

This is the expected result:

+--------+----+-----+
|type    |  id|count|
+----+--------+-----+
| GROUP_A| aa1|    2|
| GROUP_B| aa3|    3|
+--------+----+-----+

I tried this, but it does not work:

df.groupBy("type","id").count.filter("count = 'max'").show

your "expected result" doesn't correlate to what you seem to be describing: if you want the "maximum count per each group", you should get 12 for group `A` and 14 for group `B`, shouldn't you? And which `id` do you expect to see - the one matching the record with the maximum `count`? Please clarify. — Tzach Zohar, May 11 '17 at 16:07
@TzachZohar: No, I am not searching the maximum value. I want to firstly count the number of observations per group and then select the maximum `count` per group. Please see my updated thread. I explained it step by step. For avoiding any confusions I renamed initial columns to `.toDF("id","type","activity")` — Dinosaurius, May 11 '17 at 16:10

Tzach Zohar · Accepted Answer · 2017-05-11T16:58:33.830

To get the "row with maximum value of column X" (and not just that maximum value), you can use this little trick of "grouping" the relevant columns together into a struct that contains the ordering column as the first column - and then computing the max for that struct. Since the ordering of struct is "dominated" by the ordering of its first column - we'll get the desired result:

df.groupBy("id","type").count()                // get count per id and type
  .groupBy("type")                             // now group by type only
  .agg(max(struct("count", "id")) as "struct") // get maximum of (count, id) structs - since count is first, and id is unique - count will decide the ordering
  .select($"type", $"struct.id" as "id", $"struct.count" as "count") // "unwrap" structs
  .show()

// +-------+---+-----+
// |   type| id|count|
// +-------+---+-----+
// |GROUP_A|aa1|    2|
// |GROUP_B|aa3|    3|
// +-------+---+-----+

koiralo · Answer 2 · 2017-05-11T17:11:08.413

1

You can use max function after group by.

val myData = Seq(("aa1", "GROUP_A", "10"),("aa1","GROUP_A", "12"),("aa2","GROUP_A", "12"),("aa3", "GROUP_B", "14"),("aa3","GROUP_B", "11"),("aa3","GROUP_B","12" ),("aa2", "GROUP_B", "12"))

val df = sc.parallelize(myData).toDF("id","type","activity")

//count after groupby and then alias for a count fields after that find the max value in cnt field.

val newDF = df1.groupBy("type", "id").agg(count("*").alias("cnt"))

val df1 = newDF.groupBy("type").max("cnt").show

Now you can join this two dataframe to get your output.

df1.join(newDF.as("newDF"), col("cnt") === col("max(cnt)")).select($"newDF.*").show

edited May 11 '17 at 17:11

answered May 11 '17 at 16:00

koiralo

22,594
6
51
72

I have updated the answer. I got it wrong before because of count field in youe dataframe. – koiralo May 11 '17 at 16:42
Thanks. The column `id` does not appear in `newDF`. – Dinosaurius May 11 '17 at 16:45
You can join two dataframes with matching count and get the output. – koiralo May 11 '17 at 17:11

Ramesh Maharjan · Answer 3 · 2017-05-11T17:58:04.340

1

You can use Window function to find the max and remove the duplicates by combining @Tzach's answer above

val windowSpec = Window.partitionBy(col("type"))
import org.apache.spark.sql.functions._
df.groupBy("type","id").count()
  .withColumn("count", max(struct("count", "id")).over(windowSpec))
  .dropDuplicates("type")
  .select($"type", $"count.id" as "id", $"count.count" as "count").show

Thanks

edited May 11 '17 at 17:58

answered May 11 '17 at 16:09

Ramesh Maharjan

41,071
6
69
97

Could you please check my updated thread where I gave the example and expected result. – Dinosaurius May 11 '17 at 16:12
@Dinosaurius, you can use `window function` as above – Ramesh Maharjan May 11 '17 at 17:39

Calculate maximum number of observations per group

3 Answers3