Pyspark uses cProfile and works according to the docs for the RDD API, but it seems that there is no way to get the profiler to print results after running a bunch of DataFrame API operations?
from pyspark import SparkContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
rdd = sc.parallelize([('a', 0), ('b', 1)])
df = sqlContext.createDataFrame(rdd)
rdd.count() # this ACTUALLY gets profiled :)
sc.show_profiles() # here is where the profiling prints out
sc.show_profiles() # here prints nothing (no new profiling to show)
rdd.count() # this ACTUALLY gets profiled :)
sc.show_profiles() # here is where the profiling prints out in DataFrame API
df.count() # why does this NOT get profiled?!?
sc.show_profiles() # prints nothing?!
# and again it works when converting to RDD but not
df.rdd.count() # this ACTUALLY gets profiled :)
sc.show_profiles() # here is where the profiling prints out
df.count() # why does this NOT get profiled?!?
sc.show_profiles() # prints nothing?!