16
val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
              (2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
          toDF("col1", "col2", "col3"))

So I have a spark dataframe with 3 columns.

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|  10|
|   1|   b|  12|
|   1|   c|  13|
|   2|   a|  14|
|   2|   c|  11|
|   1|   b|  12|
|   2|   c|  12|
|   3|   r|  11|
+----+----+----+

My requirement is actually I need to perform two levels of groupby as explained below.

Level1: If I do groupby on col1 and do a sum of Col3. I will get below two columns. 1. col1 2. sum(col3) I will loose col2 here.

Level2: If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns. 1. col1 2. col2 3. sum(col3)

My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.

How can I do this, can anyone explain?

spark : 1.6.2 Scala : 2.10

Saurav Sahu
  • 13,038
  • 6
  • 64
  • 79
Ramesh
  • 1,563
  • 9
  • 25
  • 39

1 Answers1

29

One option is to do the two sum separately and then join them back:

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   3|   r|      11.0|      11.0|
|   1|   a|      10.0|      47.0|
+----+----+----------+----------+

Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    withColumn("sum_level1", sum($"sum_level2").over(w)).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   1|   a|      10.0|      47.0|
|   3|   r|      11.0|      11.0|
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
+----+----+----------+----------+
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 1
    Seq("col1") is this a key while joining ? – Ramesh Jan 20 '17 at 20:05
  • 1
    Yes, `Seq("col1")` specify the joining key. – Psidom Jan 20 '17 at 20:07
  • When I execute this, I see two exchanges, one for the groupBy, and one for the Window. Is there a way to avoid the second exchange? It seems like once you've done the groupBy("col1","col2"), it should not be necessary to move data around for the second step -- except if some col1 groups go across nodes. – Josiah Yoder Jul 18 '17 at 16:37
  • See [new question](https://stackoverflow.com/questions/45173093/how-to-avoid-exchanges-and-udfs-while-processing-sessions-in-spark) for my question above. – Josiah Yoder Jul 18 '17 at 17:07