Performance Improvement in scala dataframe operations

Question

I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case.

The table schema is as shown below:

+-----------------+--------------------+------------+---------+--------+---------------+
|               ID|          readout_id|readout_date|load_date|item_txt| item_value_txt|
+-----------------+--------------------+------------+---------+--------+---------------+

Later this table will be pivoted on columns item_txt and item_value_txt and many operations are applied using multiple window functions as shown below:

val windowSpec = Window.partitionBy("id","readout_date")
val windowSpec1 = Window.partitionBy("id","readout_date").orderBy(col("readout_id") desc)
val windowSpec2 = Window.partitionBy("id").orderBy("readout_date")
val windowSpec3 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpec4 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)

These window functions are used to achieve multiple logic on the data. Even there are few joins used to process the data.

The final table is partitioned with readout_date and id and could see the performance is very poor as it take much time for 100 ids and 100 readout_date

If I am not partitioning the final table I am getting the below error.

Job aborted due to stage failure: Total size of serialized results of 129 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

The expected count of id in production is billions and I expect much more throttling and performance issues while processing with complete data.

Below provided the cluster configuration and utilization metrics.

Please let me know if anything is wrong while doing repartitioning, any methods to improve cluster utilization, to improve performance...

Any leads Appreciated!

Are you using 16 cores per executor? – newzad Apr 19 '22 at 09:20 — newzad, Apr 19 '22 at 09:20
@newzad : yes it is 16 per executor – Antony Apr 19 '22 at 15:46 — Antony, Apr 19 '22 at 15:46
Did you find an answer? – Matt Andruff May 02 '22 at 14:38 — Matt Andruff, May 02 '22 at 14:38

score 0 · Answer 1 · answered Apr 19 '22 at 12:40

0

spark.driver.maxResultSize is just a setting you can increase it. BUT it's set at 4Gigs to warn you you are doing bad things and you should optimize your work. You are doing the correct thing asking for help to optimize.

The first thing I suggest if you care about performance get rid of the windows. The first 3 windows you use could be achieved using Groupby and this will perform better. The last two windows are definitely harder to reframe as a group by, but with some reframing of the problem you might be able to do it. The trick could be to use multiple queries instead of one. And you might think that would perform worse but i'm here to tell you if you can avoid using a window you will get better performance almost every time. Windows aren't bad things, they are a tool to be used but they do not perform well on unbounded data. (Can you do anything as an intermediate step to reduce the data the window needs to examine?) Or can you use aggregate functions to complete the work without having to use a window? You should explore your options.

answered Apr 19 '22 at 12:40

Matt Andruff

4,974
1
5
21

You should look at the spark UI and look at the tasks/jobs to see where you are 'losing'/'spending' time. It will help you identify what should be looked at first to optimize. – Matt Andruff Apr 19 '22 at 12:46
Below are the few activities inside the logic, which are using window functions and joins. Please have a look at those and If you could find some alternate methods to achieve the requirement, it will be helpful. – Antony Apr 19 '22 at 15:48
https://stackoverflow.com/questions/71293025/division-operation-of-column-values-in-spark-dataframe-based-on-conditions – Antony Apr 19 '22 at 15:49
https://stackoverflow.com/questions/71251321/find-number-of-null-records-between-two-non-null-records-in-scala-dataframe – Antony Apr 19 '22 at 15:50
https://stackoverflow.com/questions/71650109/unpivot-columns-into-multiple-columns-and-values-in-scala-dataframe – Antony Apr 19 '22 at 15:50
I would actually probably just do an average by month and in 90% of your data that would give you a reasonable answer and require minimal effort. If you have more interest in investigating the phenomenon, you could partition the data by 'batch' and then use group by on it after that to get more information. This might be preferable but resorts to using at least 1 window. But after that you'd have be able to partition effectively to study the issue. – Matt Andruff Apr 19 '22 at 17:38

score 0 · Answer 2 · answered Apr 19 '22 at 19:40

Given your other answers, you should be grouping by ID not windowing by Id. And likely using aggregates(sum) by week of year/month. This would likely give you really speedy performance with the loss of some granularity. This would give you enough insight to decide to look into something deeper... or not.

If you wanted more accuracy, I'd suggest using: Converting your null's to 0's.

val windowSpec1 = Window.partitionBy("id").orderBy(col("readout_date") asc) // asc is important as it flips the relationship so that it groups the previous nulls

Then create a running total on the SIG_XX VAL or whatever signal you want to look into. Call the new column 'null-partitions'.

This will effectively allow you to group the numbers(by null-partitions) and you can then run aggregate functions using group by to complete your calculations. Window and group by can do the same thing, windows just more expensive in how it moves data, slowing things down. Group by uses a more of the cluster to do the work and speeds up the process.

Performance Improvement in scala dataframe operations

2 Answers2