I am new to spark and trying to understand how to deal with skewed data in spark. I have created two tables employee and department. Employee has skewed data for one of the department.
One of the solution is to broadcast the department table and that works fine. But I want to understand how could I use salting technique in below code to improve performance.
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.appName("skewTestSpark").config("spark.sql.warehouse.dir",
'/user/hive/warehouse').enableHiveSupport().getOrCreate()
df1 = spark.sql("select * from spark.employee")
df2 = spark.sql("select id as dept_id, name as dept_name from spark.department")
res = df1.join(df2, df1.department==df2.dept_id)
res.write.parquet("hdfs://<host>:<port>/user/result/employee")