I am using spark-2.4.5 version & java1.8 in my project. I have do few aggregations based on some group by columns. Which is resulting a lot of skewness.
In my original data set I need to do aggregations on companies data based on county wise group-by data.
For similar use-case I have simple and sample dataset below.
val df1 = sc.parallelize( 1 to 10000000).map(x => (x/100000.0)).toDF("score")
.withColumn("country",lit("USA"))
val df2 = sc.parallelize( 1 to 100).map(x => (x/1000.0)).toDF("score")
.withColumn("country",lit("AUS"))
val df = df1.unionByName(df2)
df.show()
val viewName = "tabl";
df.createOrReplaceTempView(viewName);
val query = "select country, percentile(score,0.0) as percentile_0 ,avg(score) as mean , count(1) as cnt from tabl group by country "
val ben = spark.sql(query);
spark.catalog.dropTempView(viewName);
ben.show(2)