3

I am new to spark and trying to understand how to deal with skewed data in spark. I have created two tables employee and department. Employee has skewed data for one of the department.

One of the solution is to broadcast the department table and that works fine. But I want to understand how could I use salting technique in below code to improve performance.

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.appName("skewTestSpark").config("spark.sql.warehouse.dir",
                                       '/user/hive/warehouse').enableHiveSupport().getOrCreate()
df1 = spark.sql("select * from spark.employee")
df2 = spark.sql("select id as dept_id, name as dept_name from spark.department")
res = df1.join(df2, df1.department==df2.dept_id)
res.write.parquet("hdfs://<host>:<port>/user/result/employee")

Distribution for above code: enter image description here

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Ankur
  • 893
  • 3
  • 12
  • 29
  • Hi, did you get anywhere with salting? I'm at the same situation now, but the information on salting in different mediums is not very clear. – RFAI Aug 01 '22 at 23:24

1 Answers1

4

It's unlikely that employees - even with skew - will cause a Spark bottleneck. In fact the example is flawed. Think of large large JOINs and not something that will fit into broadcast join category.

Salting: With "Salting" on SQL join or Grouping etc. operation, the key is changed to redistribute data in an even manner so that processing time for whatever operation any given partition is similar.

An excellent example for a JOIN is here: https://dzone.com/articles/why-your-spark-apps-are-slow-or-failing-part-ii-da

Another good read I would recommend is here: https://godatadriven.com/blog/b-efficient-large-spark-optimisation/

I could explain it all, but the first link explains it well enough. Some experimentation is needed to get a better key distribution.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Yes, Both tables should be huge that could cause the performance issue. But If we consider the both tables having huge amount of data then how can we write salting technique in PySpark considering same scenario. – Ankur Sep 07 '20 at 09:17
  • The articles give an example of salting the key. You need to make the distribution more even. That is a question of experimentation. – thebluephantom Sep 07 '20 at 09:30