spark dataframe groupby multiple times

Question

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
              (2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
          toDF("col1", "col2", "col3"))

So I have a spark dataframe with 3 columns.

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|  10|
|   1|   b|  12|
|   1|   c|  13|
|   2|   a|  14|
|   2|   c|  11|
|   1|   b|  12|
|   2|   c|  12|
|   3|   r|  11|
+----+----+----+

My requirement is actually I need to perform two levels of groupby as explained below.

Level1: If I do groupby on col1 and do a sum of Col3. I will get below two columns. 1. col1 2. sum(col3) I will loose col2 here.

Level2: If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns. 1. col1 2. col2 3. sum(col3)

My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.

How can I do this, can anyone explain?

spark : 1.6.2 Scala : 2.10

Psidom · Accepted Answer · 2017-01-20T20:07:10.440

One option is to do the two sum separately and then join them back:

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   3|   r|      11.0|      11.0|
|   1|   a|      10.0|      47.0|
+----+----+----------+----------+

Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")

(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
    withColumn("sum_level1", sum($"sum_level2").over(w)).show)

+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
|   1|   c|      13.0|      47.0|
|   1|   b|      24.0|      47.0|
|   1|   a|      10.0|      47.0|
|   3|   r|      11.0|      11.0|
|   2|   c|      23.0|      37.0|
|   2|   a|      14.0|      37.0|
+----+----+----------+----------+

When I execute this, I see two exchanges, one for the groupBy, and one for the Window. Is there a way to avoid the second exchange? It seems like once you've done the groupBy("col1","col2"), it should not be necessary to move data around for the second step -- except if some col1 groups go across nodes. — Josiah Yoder, Jul 18 '17 at 16:37
See [new question](https://stackoverflow.com/questions/45173093/how-to-avoid-exchanges-and-udfs-while-processing-sessions-in-spark) for my question above. — Josiah Yoder, Jul 18 '17 at 17:07

spark dataframe groupby multiple times

1 Answers1

Linked