PySpark show_profile() prints nothing with DataFrame API operations

Question

Pyspark uses cProfile and works according to the docs for the RDD API, but it seems that there is no way to get the profiler to print results after running a bunch of DataFrame API operations?

from pyspark import SparkContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
rdd = sc.parallelize([('a', 0), ('b', 1)])
df = sqlContext.createDataFrame(rdd)
rdd.count()         # this ACTUALLY gets profiled :)
sc.show_profiles()  # here is where the profiling prints out
sc.show_profiles()  # here prints nothing (no new profiling to show)

rdd.count()         # this ACTUALLY gets profiled :)
sc.show_profiles()  # here is where the profiling prints out in DataFrame API

df.count()          # why does this NOT get profiled?!?
sc.show_profiles()  # prints nothing?!

# and again it works when converting to RDD but not 

df.rdd.count()      # this ACTUALLY gets profiled :)
sc.show_profiles()  # here is where the profiling prints out

df.count()          # why does this NOT get profiled?!?
sc.show_profiles()  # prints nothing?!

I tried `df.groupby('_1').count().collect()`, which apparently has both actions and transformations, and still no printout — Jason, Jan 30 '19 at 23:05

user10938362 · Answer 1 · 2019-01-30T23:31:08.457

2

That is the expected behavior.

Unlike RDD API, which provides native Python logic, DataFrame / SQL API are JVM native. Unless you invoke Python udf* (including pandas_udf), no Python code is executed on the worker machines. All that is done on the Python side, is simple API calls through Py4j gateway.

Therefore there no profiling information exists.

* Note that the udf's seem to be excluded from the profiling as well.

edited Jan 30 '19 at 23:31

answered Jan 30 '19 at 23:09

user10938362

3,991
2
12
29

I can unfortunately confirm that python UDFs are not profiled. – conradlee Jun 20 '19 at 08:49

PySpark show_profile() prints nothing with DataFrame API operations

1 Answers1