trace expensive part of code in spark ui back to part of pyspark

Question

I have some pyspark code with a very large number of joins and aggregation. I've enabled spark ui and I've been digging in to the event timeling, job stages, and dag visualization. I can find the task id and executor id for the expensive parts. Does anyone have a tip how I can tie the expensive parts from the spark ui output (task id, executor id) back to parts of my pyspark code? Like I can tell from the output that the expensive parts are caused by a large amount of shuffle operations from all my joins, but it would be really handy to identify which join was the main culprit.

score 0 · Answer 1 · answered Jun 15 '21 at 05:38

0

Your best approach is to start applying actions to your dataframes in various parts of the code. Pick a place, write it to a file, read it back, and continue. This will allow you to identify your bottlenecks. As you can observe small portions of the executions in the UI as well.

answered Jun 15 '21 at 05:38

Vitaliy

8,044
7
38
66

thanks for the tip. do you know what I would see on the event timeline, completed stages, or DAG visualization, that would tip me off that it was the "write" step? and if I add more than one is there a way to distinguish them in the spark ui? – user3476463 Jun 15 '21 at 17:47

trace expensive part of code in spark ui back to part of pyspark

1 Answers1