3

I would like to create a Map column which counts the number of occurrences.

For instance:

+---+----+
|  b|   a|
+---+----+
|  1|   b|
|  2|null|
|  1|   a|
|  1|   a|
+---+----+

would result in

+---+--------------------+
|  b|                 res|
+---+--------------------+
|  1|[a -> 2.0, b -> 1.0]|
|  2|                  []|
+---+--------------------+

For the moment, in Spark 2.4.6, I was able to make it using udaf.

While bumping to Spark3 I was wondering if I could get rid of this udaf (I tried using the new method aggregate without success)

Is there an efficient way to do it? (For the efficiency part, I am able to test easily)

BlueSheepToken
  • 5,751
  • 3
  • 17
  • 42

4 Answers4

4

Here a Spark 3 solution:

import org.apache.spark.sql.functions._

df.groupBy($"b",$"a").count()
  .groupBy($"b")
  .agg(
    map_from_entries(
      collect_list(
        when($"a".isNotNull,struct($"a",$"count"))
      )
    ).as("res")
  )
  .show()

gives:

+---+----------------+
|  b|             res|
+---+----------------+
|  1|[b -> 1, a -> 2]|
|  2|              []|
+---+----------------+

Here the solution using Aggregator:

import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Encoder

val countOcc = new Aggregator[String, Map[String,Int], Map[String,Int]] with Serializable {
    def zero: Map[String,Int] = Map.empty.withDefaultValue(0)
    def reduce(b: Map[String,Int], a: String) = if(a!=null) b + (a -> (b(a) + 1)) else b
    def merge(b1: Map[String,Int], b2: Map[String,Int]) = {
      val keys = b1.keys.toSet.union(b2.keys.toSet)
      keys.map{ k => (k -> (b1(k) + b2(k))) }.toMap
    }
    def finish(b: Map[String,Int]) = b
    def bufferEncoder: Encoder[Map[String,Int]] = implicitly(ExpressionEncoder[Map[String,Int]])
    def outputEncoder: Encoder[Map[String, Int]] = implicitly(ExpressionEncoder[Map[String, Int]])
}

val countOccUDAF = udaf(countOcc)

df
  .groupBy($"b")
  .agg(countOccUDAF($"a").as("res"))
  .show()

gives:

+---+----------------+
|  b|             res|
+---+----------------+
|  1|[b -> 1, a -> 2]|
|  2|              []|
+---+----------------+
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • @BlueSheepToken Added the solution with Aggregator – Raphael Roth Oct 14 '20 at 09:05
  • thank you very much. I am going to accept this one. Do you know the impact on performances of using `udaf` instead of `TypedColumn`? (I will probably try it soon and come back with the results here) – BlueSheepToken Oct 14 '20 at 09:22
  • Also, may I suggest to replace this: `b1(k) + b2(k)` by this: `b1.getOrElse(k, 0) + b2.getOrElse(k, 0)` ? (Except that I was struggling to apply this on a dataframe (the `val countOccUDAF = udaf(countOcc)` saved me), I came up with the exact same solution !) – BlueSheepToken Oct 14 '20 at 09:24
  • 1
    @BlueSheepToken `b1.getOrElse(k, 0)` is not necessary because I used `Map.empty.withDefaultValue(0)` – Raphael Roth Oct 16 '20 at 11:02
2

You could always use collect_list with UDF, but only if you groupings are not too lage:

val udf_histo = udf((x:Seq[String]) => x.groupBy(identity).mapValues(_.size))

df.groupBy($"b")
  .agg(
    collect_list($"a").as("as")
  )
  .select($"b",udf_histo($"as").as("res"))
  .show()

gives:

+---+----------------+
|  b|             res|
+---+----------------+
|  1|[b -> 1, a -> 2]|
|  2|              []|
+---+----------------+

This should be faster than UDAF: Spark custom aggregation : collect_list+UDF vs UDAF

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
1

We can achieve this is spark 2.4

//GET THE COUNTS
val groupedCountDf = originalDf.groupBy("b","a").count

//CREATE MAPS FOR EVERY COUNT | EMPTY MAP FOR NULL KEY
//AGGREGATE THEM AS ARRAY 

val dfWithArrayOfMaps =  groupedCountDf
.withColumn("newMap",  when($"a".isNotNull, map($"a",$"count")).otherwise(map()))
.groupBy("b").agg(collect_list($"newMap") as "multimap")

//EXPRESSION TO CONVERT ARRAY[MAP] -> MAP

val mapConcatExpr = expr("aggregate(multimap, map(), (k, v) -> map_concat(k, v))")

val finalDf = dfWithArrayOfMaps.select($"b", mapConcatExpr.as("merged_data"))
Sanket9394
  • 2,031
  • 1
  • 10
  • 15
1

Here a solution with a single groupBy and a slightly complex sql expression. This solution works for Spark 2.4+

df.groupBy("b")
  .agg(expr("sort_array(collect_set(a)) as set"),
       expr("sort_array(collect_list(a)) as list"))
  .withColumn("res",
       expr("map_from_arrays(set,transform(set, x -> size(filter(list, y -> y=x))))"))
  .show()

Output:

+---+------+---------+----------------+
|  b|   set|     list|             res|
+---+------+---------+----------------+
|  1|[a, b]|[a, a, b]|[a -> 2, b -> 1]|
|  2|    []|       []|              []|
+---+------+---------+----------------+

The idea is to collect the data from column a twice: one time into a set and one time into a list. Then with the help of transform for each element of the set the number of occurences of the particular element in the list is counted. Finally, the set and the number of elements are combined with map_from_arrays.

However I cannot say if this approach is really faster than a UDAF.

werner
  • 13,518
  • 6
  • 30
  • 45
  • this does not scale very good if groupings are large. All data for 1 group is pulled into 1 executor – Raphael Roth Oct 13 '20 at 19:32
  • @RaphaelRoth that's true. I would guess that it depends on the actual data if it's faster to collect everything to one executor or if it's better to do the second group by – werner Oct 13 '20 at 19:37
  • 1
    Also, I might be wrong but if there's any performance impact on `sort_array` and `filter`? – BlueSheepToken Oct 14 '20 at 07:42
  • 1
    @BlueSheepToken yes, depending on your data the `sort_array` and `filter` might or might not be faster than a uadf – werner Oct 14 '20 at 18:24