I have some pyspark code with a very large number of joins and aggregation. I've enabled spark ui and I've been digging in to the event timeling, job stages, and dag visualization. I can find the task id and executor id for the expensive parts. Does anyone have a tip how I can tie the expensive parts from the spark ui output (task id, executor id) back to parts of my pyspark code? Like I can tell from the output that the expensive parts are caused by a large amount of shuffle operations from all my joins, but it would be really handy to identify which join was the main culprit.
Asked
Active
Viewed 172 times
1 Answers
0
Your best approach is to start applying actions to your dataframes in various parts of the code. Pick a place, write it to a file, read it back, and continue. This will allow you to identify your bottlenecks. As you can observe small portions of the executions in the UI as well.

Vitaliy
- 8,044
- 7
- 38
- 66
-
thanks for the tip. do you know what I would see on the event timeline, completed stages, or DAG visualization, that would tip me off that it was the "write" step? and if I add more than one is there a way to distinguish them in the spark ui? – user3476463 Jun 15 '21 at 17:47