-1

I need to optimize my pyspark code in order to have an execution plan as parallel as possible; I would know if there is a better way than the .explain method (that is unreadable) to explore the DAG, like a "normal" graph object.

For example it would be very useful to know the total number of stages, the number of "first level nodes" of the DAG, etc. Thanks.

DPColombotto
  • 159
  • 1
  • 3
  • 11

1 Answers1

1

you can get a more detailed explain plan from catalyst optimizer by adding "True" .. perhaps this is what you are looking for

df = spark.range(10)
df.explain(True)
...output...
== Parsed Logical Plan ==
Range (0, 10, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint
Range (0, 10, step=1, splits=Some(8))

== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8))

== Physical Plan ==
*(1) Range (0, 10, step=1, splits=8)

more detailed you can also access the Spark UI which provides a DAG visualization and breakdown of jobs, stages, tasks, cached objects, executor distribution, and environment variables ... you can access it via url 'driver_node_host:4040' which is the default port ... docs here for additional configurations => https://spark.apache.org/docs/latest/configuration.html#spark-ui

thePurplePython
  • 2,621
  • 1
  • 13
  • 34