1

I am using spark-2.4.5 version & java1.8 in my project. I have do few aggregations based on some group by columns. Which is resulting a lot of skewness.

In my original data set I need to do aggregations on companies data based on county wise group-by data.

For similar use-case I have simple and sample dataset below.

val df1 = sc.parallelize( 1 to 10000000).map(x => (x/100000.0)).toDF("score")
          .withColumn("country",lit("USA"))

val df2 = sc.parallelize( 1 to 100).map(x => (x/1000.0)).toDF("score")
          .withColumn("country",lit("AUS"))

val df = df1.unionByName(df2)
df.show()


val viewName = "tabl";
df.createOrReplaceTempView(viewName);

val query = "select country, percentile(score,0.0) as percentile_0 ,avg(score) as mean , count(1) as cnt from  tabl group by country "
val ben = spark.sql(query);
spark.catalog.dropTempView(viewName);
ben.show(2)

enter image description here

Shasu
  • 458
  • 5
  • 22

0 Answers0