3

Pyspark uses cProfile and works according to the docs for the RDD API, but it seems that there is no way to get the profiler to print results after running a bunch of DataFrame API operations?

from pyspark import SparkContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
rdd = sc.parallelize([('a', 0), ('b', 1)])
df = sqlContext.createDataFrame(rdd)
rdd.count()         # this ACTUALLY gets profiled :)
sc.show_profiles()  # here is where the profiling prints out
sc.show_profiles()  # here prints nothing (no new profiling to show)

rdd.count()         # this ACTUALLY gets profiled :)
sc.show_profiles()  # here is where the profiling prints out in DataFrame API

df.count()          # why does this NOT get profiled?!?
sc.show_profiles()  # prints nothing?!

# and again it works when converting to RDD but not 

df.rdd.count()      # this ACTUALLY gets profiled :)
sc.show_profiles()  # here is where the profiling prints out

df.count()          # why does this NOT get profiled?!?
sc.show_profiles()  # prints nothing?!
Jason
  • 1,974
  • 24
  • 19
  • I tried `df.groupby('_1').count().collect()`, which apparently has both actions and transformations, and still no printout – Jason Jan 30 '19 at 23:05

1 Answers1

2

That is the expected behavior.

Unlike RDD API, which provides native Python logic, DataFrame / SQL API are JVM native. Unless you invoke Python udf* (including pandas_udf), no Python code is executed on the worker machines. All that is done on the Python side, is simple API calls through Py4j gateway.

Therefore there no profiling information exists.


* Note that the udf's seem to be excluded from the profiling as well.

user10938362
  • 3,991
  • 2
  • 12
  • 29