-2

For below Dataset, to get Total Summary values of Col1 , I did

import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))

and then merged with

df.union(totaldf).orderBy(col("Col1"), col("Col2").desc).show(false)

df.

+-----------+-------+--------+--------------+
|   Col1    | Col2  | price  | displayPrice |
+-----------+-------+--------+--------------+
| Category1 | item1 |     15 |           14 |
| Category1 | item2 |     11 |           10 |
| Category1 | item3 |     18 |           16 |
| Category2 | item1 |     15 |           14 |
| Category2 | item2 |     11 |           10 |
| Category2 | item3 |     18 |           16 |
+-----------+-------+--------+--------------+

After merging.

+-----------+-------+-------+--------------+
|   Col1    | Col2  | price | displayPrice |
+-----------+-------+-------+--------------+
| Category1 | Total |    44 |           40 |
| Category1 | item1 |    15 |           14 |
| Category1 | item2 |    11 |           10 |
| Category1 | item3 |    18 |           16 |
| Category2 | Total |    46 |           44 |
| Category2 | item1 |    16 |           15 |
| Category2 | item2 |    11 |           10 |
| Category2 | item3 |    19 |           17 |
+-----------+-------+-------+--------------+

Now I want summary of Whole Dataset as Below , which will have Col1 Summary as Total and has the Data of All Col1 and Col2. Required.

    +-----------+-------+-------+--------------+
    |   Col1    | Col2  | price | displayPrice |
    +-----------+-------+-------+--------------+
    | Total     | Total |    90 |           84 |
    | Category1 | Total |    44 |           40 |
    | Category1 | item1 |    15 |           14 |
    | Category1 | item2 |    11 |           10 |
    | Category1 | item3 |    18 |           16 |
    | Category2 | Total |    46 |           44 |
    | Category2 | item1 |    16 |           15 |
    | Category2 | item2 |    11 |           10 |
    | Category2 | item3 |    19 |           17 |
    +-----------+-------+-------+--------------+

How Can I be able to achieve the above result?

1 Answers1

1

create a third dataframe from the totaldf as

val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))

and then use it for union as

df.union(totaldf).union(finalTotalDF).orderBy(col("Col1"), col("Col2").desc).show(false)

You should have your final required dataframe

Updated

If ordering matters to you then you should be changing T of Total in Col2 column to t as total by doing the following

import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Col1").agg(lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))
val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"))
df.union(totaldf).union(finalTotalDF).orderBy(col("Col1").desc, col("Col2").desc).show(false)

and you should get

+---------+-----+-----+------------+
|Col1     |Col2 |price|displayPrice|
+---------+-----+-----+------------+
|Total    |total|90   |82          |
|Category2|total|46   |42          |
|Category2|item3|19   |17          |
|Category2|item2|11   |10          |
|Category2|item1|16   |15          |
|Category1|total|44   |40          |
|Category1|item3|18   |16          |
|Category1|item2|11   |10          |
|Category1|item1|15   |14          |
+---------+-----+-----+------------+

If ordering really matters to you as mentioned in the comment

I want the total Data as prioirity,So I want that to be at the Top, which is actuall the requirement for me

Then you can create another column for sorting as

import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(1).as("sort"))
val finalTotalDF= totaldf.select(lit("Total").as("Col1"), lit("Total").as("Col2"), sum("price").as("price"), sum("displayPrice").as("displayPrice"), lit(0).as("sort"))
finalTotalDF.union(totaldf).union(df.withColumn("sort", lit(2))).orderBy(col("sort"), col("Col1"), col("Col2")).drop("sort").show(false)

and you should get

+---------+-----+-----+------------+
|Col1     |Col2 |price|displayPrice|
+---------+-----+-----+------------+
|Total    |Total|90   |82          |
|Category1|Total|44   |40          |
|Category2|Total|46   |42          |
|Category1|item1|15   |14          |
|Category1|item2|11   |10          |
|Category1|item3|18   |16          |
|Category2|item1|16   |15          |
|Category2|item2|11   |10          |
|Category2|item3|19   |17          |
+---------+-----+-----+------------+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • Okay, But Total should always be on Top – Aymar harsh Mar 05 '18 at 09:33
  • 1
    Ramesh, How will I get Total in Col1, we are not doing anything as total for Column, Can you consider the Required Dataset once , The answer suggested is same as what we do for Col2 right ? `val totaldf = df.groupBy("Col1").agg(lit("Total").as("Col2"),sum("price").as("price"), sum("displayPrice").as("displayPrice"))` – Aymar harsh Mar 05 '18 at 09:43
  • Okay let me try this.Thanks though:) – Aymar harsh Mar 05 '18 at 09:46
  • last update should definitely work for you :) finally I tested as well @Aymarharsh – Ramesh Maharjan Mar 05 '18 at 10:08
  • Its works Ramesh, But to fit my Requirement , I want the Totals only on the Top rest should be Ascending Sorted @Ramesh , that would be great if u could help me solve this out – Aymar harsh Mar 05 '18 at 13:07
  • @Aymarharsh, great to hear that the answer is helpful. Acceptance and upvote would help me too. :) – Ramesh Maharjan Mar 05 '18 at 13:59
  • Yeah sure Ramesh. Just to mention again as mentioned above. I want the Totals to be on the top and rest to be ascending ordered. Can we do that in spark. – Aymar harsh Mar 06 '18 at 03:00
  • Yeah since I'll be dealing with the huge data, I want the total Data as prioirity,So I want that to be at the Top, which is actuall the requirement for me :( – Aymar harsh Mar 06 '18 at 06:46
  • you want all the tops at the top? – Ramesh Maharjan Mar 06 '18 at 07:39
  • Yeah Ramesh , all the total at the top, rest in ascending order – Aymar harsh Mar 06 '18 at 09:26
  • @Aymarharsh, I have updated my answer please have a look :) I hope the answer would get upvote and acceptance for sure now :) – Ramesh Maharjan Mar 06 '18 at 11:06
  • Excellent work Ramesh. That worked like a charm. I might have some performance issue though but it met my requirement. I already upvoted – Aymar harsh Mar 06 '18 at 12:24