I have recently been introduced to Spark-SQL
and trying to wrap my head around it. I am looking to learn best practices, tips and tricks used for optimization of Spark-SQL queries.
Most importantly, I wish to learn about interpreting Spark SQL
EXPLAIN plans. I have searched online searching for books/articles on Spark SQL
Explain but ended up with almost nothing.
Can anyone please help me and orient me in the right direction.
Due to Spark's architectural difference to traditional RDBMS, there are many relational optimization options that doesn't apply to Spark (e.g. leveraging indexes etc).
I could not find many resources related exclusively to Spark-SQL. I wish to learn about the best tips/techniques (eg, usage of hints, order of tables in join clauses i.e. keeping the largest table at the end of the joining conditions etc) to write efficient queries for Spark-SQL
.
Most importantly, any resources on understanding and leveraging Spark-SQL
Explain Plans
will be great.
However, please note that I have access only to Spark-SQL but Not PySpark SQL.
Any help is appreciated.
Thanks