10

I am trying to improve my Spark Scala skills and I have this case which I cannot find a way to manipulate so please advise!

I have original data as it shown in the figure bellow:

enter image description here

I want to calculate the percentage of every result of the count column . E.g. the last error value is 64 how much is 64 as a percentage out of the all column values. Please note that I am reading the original data as Dataframes using sqlContext: Here is my code:

    val df1 = df.groupBy(" Code")
.agg(sum("count").alias("sum"), mean("count")
.multiply(100)
.cast("integer").alias("percentag‌​e")) 

I want results similar to this:

enter image description here

Thanks in advance!

Pavel
  • 1,519
  • 21
  • 29
  • 1
    Could you please add some code sample you are playing with so far? – Pavel Oct 21 '17 at 12:51
  • I am using .....groupBy(" Code").agg(count(" Code") from large data file and I am collecting results like the table Original data. Now I want to count percentages of every count value as shown in the column 'count' – Foaad Mohamad Haddod Oct 21 '17 at 12:55
  • please share code samples you working with, even if its totally wrong this would help people understand problem(s) and give you suggestions etc – Pavel Oct 21 '17 at 12:57
  • I have tried with this code but it doesn't giving me percentage results. val df1 = df.groupBy(" Code").agg(sum("count").alias("sum"), mean("count").multiply(100).cast("integer").alias("percentage")) – Foaad Mohamad Haddod Oct 21 '17 at 13:01

1 Answers1

14

Use agg and window functions:

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

df
  .groupBy("code")
  .agg(sum("count").alias("count"))
  .withColumn("fraction", col("count") /  sum("count").over())
user8811088
  • 141
  • 2
  • 1
    First and foremost, thanks a lot for your reply. With a little changes in your code I managed to get what I need. In my case since I had columns 'Code' and 'count' I had to gruobby both to avoid re counting and grouping the Code values and getting equal percentages because the system will re count the Code values and then the percentage will be equal always. Here what I did: .groupBy("Code", "count" ).agg( sum("count") as "count1" ) .withColumn("fraction", col("count") / sum("count1").over()) and worked perfectly (: – Foaad Mohamad Haddod Oct 21 '17 at 14:36
  • Don't need groupBy("Code", "count" ) just groupBy("Code", ). – thebluephantom Jan 15 '18 at 23:05
  • 3
    @FoaadMohamadHaddod Hi, I'm getting `over() takes exactly 2 arguments (1 given)`, why? – ZK Zhao Aug 22 '18 at 10:01
  • how would you do this with regular spark SQL? – Dalupus Aug 13 '22 at 18:00