0

I have recently been introduced to Spark-SQL and trying to wrap my head around it. I am looking to learn best practices, tips and tricks used for optimization of Spark-SQL queries. Most importantly, I wish to learn about interpreting Spark SQL EXPLAIN plans. I have searched online searching for books/articles on Spark SQL Explain but ended up with almost nothing.

Can anyone please help me and orient me in the right direction.

Due to Spark's architectural difference to traditional RDBMS, there are many relational optimization options that doesn't apply to Spark (e.g. leveraging indexes etc). I could not find many resources related exclusively to Spark-SQL. I wish to learn about the best tips/techniques (eg, usage of hints, order of tables in join clauses i.e. keeping the largest table at the end of the joining conditions etc) to write efficient queries for Spark-SQL.

Most importantly, any resources on understanding and leveraging Spark-SQL Explain Plans will be great. However, please note that I have access only to Spark-SQL but Not PySpark SQL.

Any help is appreciated.

Thanks

marie20
  • 723
  • 11
  • 30
  • Hey. You could find this talk useful https://youtu.be/99fYi2mopbs – facha May 31 '20 at 20:31
  • In the first instance, why don't you try reording tables and see if it makes a difference? This (the first result when I googled "spark does table order matter" https://stackoverflow.com/questions/28694523/in-spark-join-does-table-order-matter-like-in-pig says no. My understanding of spark is that it has a lot to do with getting maximum parallelism due to correct partitioning of data – Nick.Mc May 31 '20 at 22:33
  • This book could answer all your queries https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/ – alexgids Jun 01 '20 at 04:01

0 Answers0