1

I'm trying to optimize one program with Spark SQL, this program is basically a HUGE SQL query (joins like 10 tables with many cases etc etc). I'm more used to more DF-API-oriented programs, and those did show the different stages much better.

It's quite well structured and I understand it more or less. However I have a problem, I always use Spark UI SQL view to get hints on where to focus the optimizations.

However in this kind of program Spark UI SQL shows nothing, is there a reason for this? (or a way to force it to show).

I'm expecting to see each join/scan with the number of output rows after it and such.... but I only see a full "WholeStageCodeGen" for a "Parsed logical plan" which is like 800lines

I can't show code, it has the following "points":

1- Action triggering it, its "show"(20)
3- Takes like 1 hour of execution (few executors yet)
2- has a persist before the show/action.
3- Uses Kudu, Hive and In-memory tables (registered before this query)
4- Has like 700 lines logical plan

Is there a way to improve the tracing there? (maybe disabling WholeStageCodegen?, but that may hurt performance...)

This is what I see

Something like this is what I expected to see... (of course much complex plan)

Thanks!

BiS
  • 501
  • 4
  • 17
  • 1
    I think you're looking downstream from the persist, try removing the persist – Manu Valdés Mar 15 '19 at 20:02
  • Could be, isn't there a way to prevent persists and caches from breaking the UI? the steps before the persist don't seem to appear. I've tried removing the show tho – BiS Mar 15 '19 at 20:15
  • I've changed a few things and yeah, I added a few phases and it starts from persist, will try without it Thanks manu! – BiS Mar 15 '19 at 20:56

0 Answers0