Spark: count percentage percentages of a column values

Question

I am trying to improve my Spark Scala skills and I have this case which I cannot find a way to manipulate so please advise!

I have original data as it shown in the figure bellow:

I want to calculate the percentage of every result of the count column . E.g. the last error value is 64 how much is 64 as a percentage out of the all column values. Please note that I am reading the original data as Dataframes using sqlContext: Here is my code:

    val df1 = df.groupBy(" Code")
.agg(sum("count").alias("sum"), mean("count")
.multiply(100)
.cast("integer").alias("percentag‌e"))

I want results similar to this:

Thanks in advance!

Could you please add some code sample you are playing with so far? — Pavel, Oct 21 '17 at 12:51
I am using .....groupBy(" Code").agg(count(" Code") from large data file and I am collecting results like the table Original data. Now I want to count percentages of every count value as shown in the column 'count' — Foaad Mohamad Haddod, Oct 21 '17 at 12:55
please share code samples you working with, even if its totally wrong this would help people understand problem(s) and give you suggestions etc — Pavel, Oct 21 '17 at 12:57
I have tried with this code but it doesn't giving me percentage results. val df1 = df.groupBy(" Code").agg(sum("count").alias("sum"), mean("count").multiply(100).cast("integer").alias("percentage")) — Foaad Mohamad Haddod, Oct 21 '17 at 13:01

score 14 · Answer 1 · answered Oct 21 '17 at 13:47

14

Use agg and window functions:

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

df
  .groupBy("code")
  .agg(sum("count").alias("count"))
  .withColumn("fraction", col("count") /  sum("count").over())

answered Oct 21 '17 at 13:47

user8811088

141
2

1

First and foremost, thanks a lot for your reply. With a little changes in your code I managed to get what I need. In my case since I had columns 'Code' and 'count' I had to gruobby both to avoid re counting and grouping the Code values and getting equal percentages because the system will re count the Code values and then the percentage will be equal always. Here what I did: .groupBy("Code", "count" ).agg( sum("count") as "count1" ) .withColumn("fraction", col("count") / sum("count1").over()) and worked perfectly (: – Foaad Mohamad Haddod Oct 21 '17 at 14:36
Don't need groupBy("Code", "count" ) just groupBy("Code", ). – thebluephantom Jan 15 '18 at 23:05
3

@FoaadMohamadHaddod Hi, I'm getting `over() takes exactly 2 arguments (1 given)`, why? – ZK Zhao Aug 22 '18 at 10:01
how would you do this with regular spark SQL? – Dalupus Aug 13 '22 at 18:00

Spark: count percentage percentages of a column values

1 Answers1