Spark optimize "DataFrame.explain" / Catalyst

Question

I've got a complex software which performs really complex SQL queries (well not queries, Spark plans you know). <-- The plans are dynamic, they change based on user input so I can't "cache" them.

I've got a phase in which spark takes 1.5-2min building the plan. Just to make sure, I added "logXXX", then explain(true), then "logYYY" and it takes 1minute 20 seconds for the explain to execute.

I've trying breaking the lineage but this seems to cause worse performance because the actual execution time becomes longer.

I can't parallelize driver work (already did, but this task can't be overlapped with anything else).

Any ideas/guide on how to improve the plan builder in Spark? (like for example, flags to try enabling/disabling and such...)

Is there a way to cache plans in Spark? (so I can run that in parallel and then execute it)

I've tried disabling all possible optimizer rules, setting min iterations to 30... but nothing seems to affect that concrete point :S

I tried disabling wholeStageCodegen and it helped a little, but the execution is longer so :).

Thanks!,

PS: The plan does contain multiple unions (<20, but quite complex plans inside each union) which are the cause for the time, but splitting them apart also affects execution time.

Just found https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala, but more things are welcome :) — BiS, Sep 01 '19 at 20:18
Trying with ("spark.sql.optimizer.maxIterations") to 30, no effect :(. Also tried disabling CombineUnions rule — BiS, Sep 01 '19 at 21:27
Tried disabling ALL optimizer rules and it still takes 1.30 minutes :/. I'll have to enable more logging :S — BiS, Sep 01 '19 at 21:42
How did we disable all rules. Couple of things that could attribute to slowdown 1. metastore calls to fetch tabe information 2) If the persisten store iscloused based(S3 etc), requires api call to list file to create, which isan expenisve operation — DaRkMaN, Sep 02 '19 at 02:37
I disabled all the rules which are available to disable (some are mandatory) by using spark.sql.optimizer.excludedRules. The storages are only Hive and KUDU (mostly Kudu). Metastore calls could be a reason yes but I've got many other phases where I use mostly the same tables without a problem. Also, the time is AFTER I call "count", shouldn't metastore be done before? (so Spark can know the columns it's working with when I'm creating the DAF) — BiS, Sep 02 '19 at 06:16
For now I'm going to try to overlap some tasks which can be stripped-out of the big plan (and they scan big amounts of data) so they run while Spark calculates the big plan :/ — BiS, Sep 02 '19 at 06:57

BiS · Answer 1 · 2019-09-02T15:28:14.667

0

Just in case it helps someone (and if no-one provides more insights).

As I couldn't manage to reduce optimizer times (and well, not sure if reducing optimizer times would be good, as I may lose execution time).

One of the latest parts of my plan was scanning two big tables and getting one column from each one of them (using windows, aggregations etc...).

So I splitted my code in two parts:

1- The big plan (cached)
2- The small plan which scans and aggregates two big tables (cached)

And added one more part:

3- Left Join/enrich the big plan with the output of "2" (this takes like 10seconds, the dataset is not so big) and finish the remainder computation.

Now I launch both actions (1,2) in parallel (using driver-level parallelism/threads), cache the resulting DataFrames and then wait+ afterwards perform 3.

With this, while Spark driver (thread 1) is calculating the big plan (~2minutes) the executors will be executing part "2" (which has a small plan, but big scans/shuffles) and then both get "mixed" in like 10-15seconds, which a good improvement in execution time over the 1:30 I save while calculating the plan.

Comparing times:

Before I would have

1:30 Spark optimizing time + 6 minutes execution time

Now I have

max
(
1:30 Spark Optimizing time + 4 minutes execution time, 
0:02 Spark Optimizing time + 2 minutes execution time
) 
+ 15 seconds joining both parts

Not so much, but quite a few "expensive" people will be waiting for it to finish :)

edited Sep 02 '19 at 15:28

answered Sep 02 '19 at 12:50

BiS

501
4
17

1

Is this about influencing, adding own rules for Catalyst Optimizer? – thebluephantom Sep 02 '19 at 13:25
No, it's modifying user (my) code in order to do it. Adding more rules wouldn't solve anything (especially since I don't know the optimizer that well in order to know which rule is being slow and how to optimize it in a custom-rule) – BiS Sep 02 '19 at 15:22
1

Custom rule amounts to the same thing in my book – thebluephantom Sep 02 '19 at 15:31
Not sure, I mean I don't know custom rules a lot, but if your action takes 2 minutes to get started just because it's being optimized, can you start the plan partially by using a custom rule?. I mean, I would understand custom-rules work similar to a compiler, you can improve the output of the plan, but I want part of the plan execution to be shortcutted/overlapped with the actual calculation of the plan. – BiS Sep 02 '19 at 15:33
I've read a little more about the rules (specially about the custom strategies) and although this is something which could help optimizing, it does require a very deep knowledge of Catalyst if you want to optimice something as big as my problem (+I don't know where the time is actually "lost"), but I do not see a way to perform parallel computation while optimizing there (probably you can parallelize the execution, but before getting the plan?), or at least not better than my current approach effort/benefit :S. – BiS Sep 02 '19 at 16:12

Spark optimize "DataFrame.explain" / Catalyst

1 Answers1