Computing optimal Shuffle Partitions and mitigating Skew in Spark SQL Query

Question

I work with Spark SQL v2.4.7 on EMR (with YARN). I write Spark Sql queries to perform transformations.

Estimating the optimal Shuffle Partitions number for a complex query:
I am trying to estimate the number of optimal shuffle partitions that needs to be set in order to get optimal performance for a complex query with several joins. On the internet I found that the optimal size of a partition should be within the range of 10 MB-100 MB. Now, since I know this value, my next step is to calculate the data shuffle volume of the query (in MB) and then divide it by 100 to get the shuffle partition number. However, it becomes extremely difficult to estimate the shuffle volume for a complex query involving several joins with big tables.
So how do I estimate the amount of shuffle volume, and consequently, the optimal number of shuffle partitions required for the big query ? Currently (after much searching) I am performing the following steps-

scala> spark.sql("""
       | create temp view initial_tbl
       | as
       | with first_qry as 
       | (
       | select a.id, 
       | a.fund_id,
       | b.fname, 
       | c.state_nm, 
       | row_number() over (partition by a.id order by c.eff_dt desc) as rownum
       | from tblA a
       | left join tblB b
       | on a.id = b.id
       | left join tblC c
       | on b.st_id = c.st_id
       | )
       | select * from first_qry
       | where rownum = 1
       | """)

scala> spark.sql("""
       | create temp view final_tbl as
       | select a.id, a.fname, a.state_nm, b.salary, c.age
       | from initial_tbl a
       | left join fin_dtls b
       | on a.id = b.id
       | and a.fund_id = b.fund_id
       | left join age_dtls c
       | on b.age_id = c.age_id
       | union all
       | select id, fname, 'NA' as state_nm, salary, age
       | from another_pre_created_tbl
       | """)
scala> spark.sql("""
       | select * from final_tbl 
       | limit 50
       | """)

Note: This is just a simplified version of the actual query.

ok, now, I am trying to estimate the data-size of this query which I can then divide by 100 MB to get the optimal shuffle-partition number for the query.

scala> val df = spark.read.table("final_tbl")
scala> println(df.queryExecution.analyzed.stats)
Statistics(sizeInBytes=34.5 GB, hints=none)

So the size of the above query is 34.5 GB which when divided by 100 MB gives ~ 350 shuffle partitions. Now after setting the config SET spark.sql.shuffle.partitions=350, I still see that the query is slow. So my question is -

Am I doing this right ? Else, please let me know how can I calculate the shuffle volume of a complex query (involving several joins) and finally be able to compute the optimal number of shuffle partitions for any given complex query.

SKEW:
For the query mentioned above, I see that 12 Jobs are triggered in the Spark UI. In the UI, the last job shows high skew, i.e. where one task is a long bar and other simultaneous tasks are represented by several very small bars (I wish I could provide the UI screenshot) - So, my question (based on the above) is-

How do I identify which part of the above query, or specifically speaking, which table/column in this big complex query is the main culprit causing the skew ? With so many tables and joins in a big query it becomes very difficult and time-consuming to test by joining 2 tables at a time and checking the UI and progress.
So is there any smart way to find out the actual join table/column causing the skew?
Also, after identifying the table/column causing the skew, how can I fix this so that all the partitions have equal volume of data to process which will in turn speed up the job ?
How do I relate a specific job in the UI (that spark triggered) to which specific part of the query?

Skew while Writing to Output:
Finally I am writing the output of the above query to S3 in the SQL API (%sql%) as-

create table final_out
using parquet
options (
         path 's3:/my/test/location/',
         mode: 'overwrite'
        )
as
select * from final_tbl
distribute by id;

Even for this when I check the UI, I find a massive skew like above, with one task as a very long bar and other simultaneous tasks as very small bars. If you notice closely, you will find that the above shown final query does a union all with another query which has a hard-coded value(i.e. 'NA' as state_nm). Now, since there are around 100 Million records in the union-ed table, the value 'NA' becomes the dominant value for the column state_nm in the output and hence creating the skew, thus making the write very slow.

So my final question is -

How can I alleviate such skews caused by hard-coded values in the output while writing to disk as parquet files (with the sql API) ? Please note that I have tried to repartition the final output dataset on it's PK column (id) to force an even distribution of the records before writing to output, but in vain - Pls note the distribute by id part at the end of the create table statement given above.

My cluster configuration is as follows:

Nodes: 20
Cores:  8
Memory: 64 GB

I am extremely sorry for the long post, but these questions are bugging me for very long. I have searched the internet a lot but couldn't find any concrete answer. Can anyone please help me address these questions. Any help is appreciated.

Thanks.

score 0 · Answer 1 · answered Jan 31 '22 at 11:44

Can't answer all you questions, but I can share a couple of thoughts, as I've had some of those issues:

How do I identify which part of the above query, or specifically speaking, which table/column in this big complex query is the main culprit causing the skew?

You could list all the tables in your queries, do a count on the column that is used for joining them, and see what values represent outsized portion of the rows. If you want to automate that, use pandas profiler or great expectation libraries to auto-generate summaries of the columns for each table.

Am I doing this right ? Else, please let me know how can I calculate the shuffle volume of a complex query (involving several joins) and finally be able to compute the optimal number of shuffle partitions for any given complex query.

I'm not sure there's more to be done in terms of shuffle partition setting, the only thing that comes to mind is calculating the size of the bigest table before executing the query and using that to a calculate/estimate the shuffle partition number dynamically using spark.conf.set("spark.sql.shuffle.partitions", calculatedNumber), but I'm not convinced that'd work out/help.

In my experience bigger benefits should come from caching tables that are used multiple times, broadcasting smaller tables and partitioning larger dataframes on the join columns before running the query.

As for writing, I suspect that it's not a matter of writing itsef that makes the process slower, but rather that you're performing the whole calculation of the final query before writing (lazy execution), and that takes most of that time.

Thanks for your reply.. You mention that - "I'm performing the whole calculation of the final query before writing" - could you please elaborate this, if there is anything else that could be done to optimize what I am doing (calculate -> write) currently ? what are my options ? if there is any best practice please share that.. btw, I use the Spark 2.4 **SQL API** only, for writing "chained" sql queries to transform the data in steps - u can find examples of how i do the transformations thru queries in my question above. Pls share your thoughts. — marie20, Feb 05 '22 at 15:20

Computing optimal Shuffle Partitions and mitigating Skew in Spark SQL Query

1 Answers1