How to build a histogram in pyspark

Question

I have a large pyspark dataframe and want a histogram of one of the columns.

I can do:

df.select.("col").rdd.flatMap(lambda x: x).histogram(100)

but this is very slow, seems to convert the dataframe to an rdd, and I am not even sure why I need the flatMap.

What is the best/fastest way to achieve this?

You only need `flatMap` is your column contain nested values. Refer to this question for other ways : https://stackoverflow.com/questions/36043256/making-histogram-with-spark-dataframe-column — Spandan Brahmbhatt, Sep 13 '17 at 16:17

score -6 · Answer 1 · answered Mar 17 '18 at 08:58

-6

convert your data frame to pandas data frame as

df_pd = df.toPandas()

then use,

%matplotlib inline
import matplotlib.pyplot as plt
df_pd.hist(column='column name')

This should work

answered Mar 17 '18 at 08:58

Sravan M

2

Conversion to Pandas DataFrame is very inefficient and also not guaranteed working in low memory environments. – ciurlaro Nov 29 '20 at 00:45

1 Answers1