I have a large pyspark dataframe and want a histogram of one of the columns.
I can do:
df.select.("col").rdd.flatMap(lambda x: x).histogram(100)
but this is very slow, seems to convert the dataframe to an rdd, and I am not even sure why I need the flatMap.
What is the best/fastest way to achieve this?