spark skew data aggregation base on gropBy taking more time

Asked Mar 16 '22 at 18:55

Active Mar 16 '22 at 18:55

Viewed 636 times

I am using spark-2.4.5 version & java1.8 in my project. I have do few aggregations based on some group by columns. Which is resulting a lot of skewness.

In my original data set I need to do aggregations on companies data based on county wise group-by data.

For similar use-case I have simple and sample dataset below.

val df1 = sc.parallelize( 1 to 10000000).map(x => (x/100000.0)).toDF("score")
          .withColumn("country",lit("USA"))

val df2 = sc.parallelize( 1 to 100).map(x => (x/1000.0)).toDF("score")
          .withColumn("country",lit("AUS"))

val df = df1.unionByName(df2)
df.show()


val viewName = "tabl";
df.createOrReplaceTempView(viewName);

val query = "select country, percentile(score,0.0) as percentile_0 ,avg(score) as mean , count(1) as cnt from  tabl group by country "
val ben = spark.sql(query);
spark.catalog.dropTempView(viewName);
ben.show(2)

asked Mar 16 '22 at 18:55

Shasu

1

@Raphael Roth Can you please help me how to solved the above skew data. – Shasu Mar 16 '22 at 18:56
1

@Shaido can you please tell me how to handle groupBy of skewed data ? – Shasu Mar 16 '22 at 18:57
1

@Jacek Laskowski how to handle groupBy of skewed data ? – Shasu Mar 16 '22 at 18:59
https://spark.apache.org/docs/latest/api/sql/#approx_percentile – David דודו Markovitz Mar 16 '22 at 20:41
@David דודו Markovitz how does this solve the problem of skewness – Shasu Mar 17 '22 at 01:35
Yes. The execution time drops to the floor. – David דודו Markovitz Mar 17 '22 at 05:21
@DavidדודוMarkovitz so how to improve it ? how to avoid skewness ? – Shasu Mar 17 '22 at 09:53
1

Use approx_percentile instead of percentile – David דודו Markovitz Mar 17 '22 at 09:57
@DavidדודוMarkovitz thats not giving the correct value. hence I am forced to use "percentile" instead of "approx_percentile" – Shasu Mar 21 '22 at 09:31
1

(1) "approx" is a shortcut for "approximate" (2) you can tune the error rate – David דודו Markovitz Mar 21 '22 at 09:33
@DavidדודוMarkovitz Even if I tune error rate how can I avoid skewness ? as i am using groupBy ...how to handle it ? – Shasu Mar 21 '22 at 09:34
Have you tested it? – David דודו Markovitz Mar 21 '22 at 10:16
@DavidדודוMarkovitz sorry for delay ... yes i tested its not giving exact results. – Shasu Mar 30 '22 at 16:04
1

(1) Performance wise you should have seen a massive improvement (2) As I said, "approx" is a shortcut for "approximate" and you can tune the error rate – David דודו Markovitz Mar 30 '22 at 16:06
@DavidדודוMarkovitz , any clue on this issue https://stackoverflow.com/questions/74035832/exception-occured-while-writing-delta-format-in-aws-s3 ? – Shasu Oct 12 '22 at 02:29

spark skew data aggregation base on gropBy taking more time

0 Answers0