1

I work in an ETL development team where we use Spark-SQL to transform the data by creating and progressing with several intermediate temporary views in sequence and finally ending up with another temp view whose data is then copied into the target table folder.
However, at several instances our queries takes excessive amount of time even when dealing with small number of records (<~ 10K) and we scramble for help in all direction.

Hence I would like to know and learn about Spark SQL performance tuning in details (e.g. behind the scenes, architecture, and most importantly - interpreting Explain plans etc) which would help me to learn and create a solid foundation on the subject. I have experience in performance tuning with RDBMS (Teradata, Oracle etc) in the past.

Since, I am very new to this can anyone please point me in the right direction where I can find books, tutorials, courses etc on this subject. I have searched the internet and even several online learning platforms but couldn't find any comprehensive tutorial or resource to learn this.

Please help ! Thanks in advance..

1 Answers1

0

I would not go into details as they can be very comprehensive. There are some concepts that you should consider while tuning your job.

  1. Number of Executors
  2. Number of Executor Cores
  3. Executor Memory

Above 3 things directly impact the magnitude of parallelism achieved by your Application.

  1. Shuffling
  2. Spilling
  3. Partitioning
  4. Bucketing

Above are important with respect to your Data w.r.t Storage and Format.

P.S: Its just the tip of an iceberg! Goodluck

I am attaching a few links that refer to scaling Spark Jobs. That could be a nice starting point.

Scaling Spark Jobs At Facebook

Joins and Shuffling

Shahab Niaz
  • 170
  • 10