Get mode (most often) value in Spark column with groupBy

Question

I have a SparkR DataFrame and I want to get the mode (most often) value for each unique name. How can I do this? There doesn't seem to be a built-in mode function. Either a SparkR or PySpark solution will do.

# Create DF
df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
  value = c(5, 5, 4, 3, 3, 7))
DF <- createDataFrame(df)

name   | value
-----------------
Thomas |  5
Thomas |  5
Thomas |  4
Bill   |  3
Bill   |  3
Bill   |  9

# What I want to get
name   | mode(value)
-----------------
Thomas |   5
Bill   |   3

Konrad Kostrzewa · Accepted Answer · 2017-06-28T17:03:32.810

8

You could achieve that using combination of .groupBy() and window methods like this:

grouped = df.groupBy('name', 'value').count()
window = Window.partitionBy("name").orderBy(desc("count"))
grouped\
    .withColumn('order', row_number().over(window))\
    .where(col('order') == 1)\
    .show()

outputs:

+------+-----+-----+-----+
|  name|value|count|order|
+------+-----+-----+-----+
|  Bill|    3|    2|    1|
|Thomas|    5|    2|    1|
+------+-----+-----+-----+

edited Jun 28 '17 at 17:03

answered Jun 28 '17 at 15:46

Konrad Kostrzewa

825
7
16

Thanks, @pandaromeo. Can you explain what the `Window.partitionBy('name').orderBy(desc('count'))` does? I'm also having trouble converting this code to SparkR, though the `windowPartitionBy` commands exists there. – Gaurav Bansal Jun 28 '17 at 19:06
after `groupBy` and `agg` you end up with `count` column. Imagine that you perform `orderBy` on it in ascending order. With `window` You not only sort it but also give it index which enables You to filter out only the ones which interests You. In this case those are `mode (most often) value`. Unfortunately I have no experience with R so I can't translate it to SparkR. – Konrad Kostrzewa Jun 28 '17 at 19:24
Actually, this link explains it well: https://github.com/apache/spark/blob/master/R/pkg/vignettes/sparkr-vignettes.Rmd. – Gaurav Bansal Jun 28 '17 at 19:36
If you want to ignore null values do `grouped = df.where(df["value"].isNotNull()).groupBy("name", "value").count()` – asmaier Oct 06 '17 at 14:12

ZygD · Answer 2 · 2023-08-01T11:40:48.733

Spark 3.4+ has mode column function.

PySpark full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('Thomas', 5),
     ('Thomas', 5),
     ('Thomas', 4),
     ('Bill', 3),
     ('Bill', 3),
     ('Bill', 9)],
    ['name', 'value'])

df.groupBy('name').agg(F.mode('value')).show()
# +------+-----------+
# |  name|mode(value)|
# +------+-----------+
# |Thomas|          5|
# |  Bill|          3|
# +------+-----------+

SparkR full example:

df <- data.frame(name = c("Thomas", "Thomas", "Thomas", "Bill", "Bill", "Bill"),
                 value = c(5, 5, 4, 3, 3, 9))
df <- as.DataFrame(df)

df <- agg(groupBy(df, 'name'), expr("mode(value)"))

showDF(df)
# +------+-----------+
# |  name|mode(value)|
# +------+-----------+
# |Thomas|        5.0|
# |  Bill|        3.0|
# +------+-----------+

score 0 · Answer 3 · answered Jun 28 '17 at 20:09

Here's the SparkR version of the solution:

grouped <- agg(groupBy(df, 'name', 'value'), count=count(df$value))
window <- orderBy(windowPartitionBy("name"), desc(grouped$count))
dfmode <- withColumn(grouped, 'order', over(row_number(), window))
dfmode <- filter(dfmode, dfmode$order==1)

Get mode (most often) value in Spark column with groupBy

3 Answers3

Linked