Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

Question

I am trying to use Spark (Scala) dataframes to do groupby aggregates for mode and the corresponding count.

For example,

Suppose we have the following dataframe:

Category   Color   Number   Letter      
1        Red         4        A
1        Yellow      Null     B
3        Green       8        C
2        Blue        Null     A
1        Green       9        A
3        Green       8        B
3        Yellow      Null     C
2        Blue        9        B
3        Blue        8        B
1        Blue        Null     Null
1        Red         7        C
2        Green       Null     C
1        Yellow      7        Null
3        Red         Null     B

Now we want to group by Category, then Color, and then find the size of the grouping, count of number non-nulls, the total size of number, the mean of number, the mode of number, and the corresponding mode count. For letter I'd like the count of non-nulls and the corresponding mode and mode count (no mean since this is a string).

So the output would ideally be:

Category     Color     CountNumber(Non-Nulls)   Size   MeanNumber  ModeNumber ModeCountNumber   CountLetter(Non-Nulls)  ModeLetter   ModeCountLetter
1            Red       2                        2      5.5         4 (or 7) 
1            Yellow    1                        2      7           7     
1            Green     1                        1      9           9       
1            Blue      1                        1      -           -       
2            Blue      1                        2      9           9      etc 
2            Green     -                        1      -           -       
3            Green     2                        2      8           8       
3            Yellow    -                        1      -           -       
3            Blue      1                        1      8           8       
3            Red       -                        1      -           -

This is easy to do for the count and mean but more tricky for everything else. Any advice would be appreciated.

Thanks.

Tzach Zohar · Accepted Answer · 2017-09-12T18:30:27.713

As far as I know - there's no simple way to compute mode - you have to count the occurrences of each value and then join the result with the maximum (per key) of that result. The rest of the computations are rather straight-forward:

// count occurrences of each number in its category and color
val numberCounts = df.groupBy("Category", "Color", "Number").count().cache()

// compute modes for Number - joining counts with the maximum count per category and color:
val modeNumbers = numberCounts.as("base").join(numberCounts.groupBy("Category", "Color").agg(max("count") as "_max").as("max"),
  $"base.Category" === $"max.Category" and
  $"base.Color" === $"max.Color" and
  $"base.count" === $"max._max")
  .select($"base.Category", $"base.Color", $"base.Number", $"_max")
  .groupBy("Category", "Color")
  .agg(first($"Number", ignoreNulls = true) as "ModeNumber", first("_max") as "ModeCountNumber")
  .where($"ModeNumber".isNotNull)

// now compute Size, Count and Mean (simple) and join to add Mode:
val result = df.groupBy("Category", "Color").agg(
  count("Color") as "Size", // counting a key column -> includes nulls
  count("Number") as "CountNumber", // does not include nulls
  mean("Number") as "MeanNumber"
).join(modeNumbers, Seq("Category", "Color"), "left")

result.show()
// +--------+------+----+-----------+----------+----------+---------------+
// |Category| Color|Size|CountNumber|MeanNumber|ModeNumber|ModeCountNumber|
// +--------+------+----+-----------+----------+----------+---------------+
// |       3|Yellow|   1|          0|      null|      null|           null|
// |       1| Green|   1|          1|       9.0|         9|              1|
// |       1|   Red|   2|          2|       5.5|         7|              1|
// |       2| Green|   1|          0|      null|      null|           null|
// |       3|  Blue|   1|          1|       8.0|         8|              1|
// |       1|Yellow|   2|          1|       7.0|         7|              1|
// |       2|  Blue|   2|          1|       9.0|         9|              1|
// |       3| Green|   2|          2|       8.0|         8|              2|
// |       1|  Blue|   1|          0|      null|      null|           null|
// |       3|   Red|   1|          0|      null|      null|           null|
// +--------+------+----+-----------+----------+----------+---------------+

As you can imagine - this might be slow, as it has 4 groupBys and two joins - all requiring shuffles...

As for the Letter column statistics - I'm afraid you'll have to repeat this for that column separately and add another join.

Hmm im having issues with the mode and mode count not producing correct outputs. Note I'm using a different dataframe to start. — user48944, Sep 12 '17 at 23:22
I've added my current code. Basically it is giving me null (and incorrect) values for mode. — user48944, Sep 12 '17 at 23:29

Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

1 Answers1

Linked