I work with Spark SQL v2.4.7 on EMR (with YARN). I write Spark Sql queries to perform transformations.
Estimating the optimal Shuffle Partitions number for a complex query:
I am trying to estimate the number of optimal shuffle partitions that needs to be set in order to get optimal performance for a complex query with several joins. On the internet I found that the optimal size of a partition should be within the range of 10 MB
-100 MB
. Now, since I know this value, my next step is to calculate the data shuffle volume of the query (in MB) and then divide it by 100
to get the shuffle partition number. However, it becomes extremely difficult to estimate the shuffle volume for a complex query involving several joins with big tables.
So how do I estimate the amount of shuffle volume, and consequently, the optimal number of shuffle partitions required for the big query ? Currently (after much searching) I am performing the following steps-
scala> spark.sql("""
| create temp view initial_tbl
| as
| with first_qry as
| (
| select a.id,
| a.fund_id,
| b.fname,
| c.state_nm,
| row_number() over (partition by a.id order by c.eff_dt desc) as rownum
| from tblA a
| left join tblB b
| on a.id = b.id
| left join tblC c
| on b.st_id = c.st_id
| )
| select * from first_qry
| where rownum = 1
| """)
scala> spark.sql("""
| create temp view final_tbl as
| select a.id, a.fname, a.state_nm, b.salary, c.age
| from initial_tbl a
| left join fin_dtls b
| on a.id = b.id
| and a.fund_id = b.fund_id
| left join age_dtls c
| on b.age_id = c.age_id
| union all
| select id, fname, 'NA' as state_nm, salary, age
| from another_pre_created_tbl
| """)
scala> spark.sql("""
| select * from final_tbl
| limit 50
| """)
Note: This is just a simplified version of the actual query.
ok, now, I am trying to estimate the data-size of this query which I can then divide by 100 MB
to get the optimal shuffle-partition number for the query.
scala> val df = spark.read.table("final_tbl")
scala> println(df.queryExecution.analyzed.stats)
Statistics(sizeInBytes=34.5 GB, hints=none)
So the size of the above query is 34.5 GB
which when divided by 100 MB
gives ~ 350
shuffle partitions.
Now after setting the config SET spark.sql.shuffle.partitions=350
, I still see that the query is slow. So my question is -
- Am I doing this right ? Else, please let me know how can I calculate the shuffle volume of a complex query (involving several joins) and finally be able to compute the optimal number of shuffle partitions for any given complex query.
SKEW:
For the query mentioned above, I see that 12
Jobs are triggered in the Spark UI. In the UI, the last job shows high skew, i.e. where one task is a long bar and other simultaneous tasks are represented by several very small bars (I wish I could provide the UI screenshot) -
So, my question (based on the above) is-
- How do I identify which part of the above query, or specifically
speaking, which table/column in this big complex query is the main
culprit causing the skew ? With so many tables and joins in a big query
it becomes very difficult and time-consuming to test by joining 2 tables at a time and checking the UI and progress.
So is there any smart way to find out the actual join table/column causing the skew? - Also, after identifying the table/column causing the skew, how can I fix this so that all the partitions have equal volume of data to process which will in turn speed up the job ?
- How do I relate a specific job in the UI (that spark triggered) to which specific part of the query?
Skew while Writing to Output:
Finally I am writing the output of the above query to S3
in the SQL API (%sql%
) as-
create table final_out
using parquet
options (
path 's3:/my/test/location/',
mode: 'overwrite'
)
as
select * from final_tbl
distribute by id;
Even for this when I check the UI, I find a massive skew like above, with one task as a very long bar and other simultaneous tasks as very small bars. If you notice closely, you will find that the above shown final query does a union all
with another query which has a hard-coded value(i.e. 'NA' as state_nm
). Now, since there are around 100 Million
records in the union
-ed table, the value 'NA'
becomes the dominant value for the column state_nm
in the output and hence creating the skew, thus making the write very slow.
So my final question is -
- How can I alleviate such skews caused by hard-coded values in the output while writing to disk as parquet
files (with the sql API) ? Please note that I have tried to
repartition
the final output dataset on it's PK column (id
) to force an even distribution of the records before writing to output, but in vain - Pls note thedistribute by id
part at the end of thecreate table
statement given above.
My cluster configuration is as follows:
Nodes: 20
Cores: 8
Memory: 64 GB
I am extremely sorry for the long post, but these questions are bugging me for very long. I have searched the internet a lot but couldn't find any concrete answer. Can anyone please help me address these questions. Any help is appreciated.
Thanks.