2

I have a large pyspark dataframe and want a histogram of one of the columns.

I can do:

df.select.("col").rdd.flatMap(lambda x: x).histogram(100)

but this is very slow, seems to convert the dataframe to an rdd, and I am not even sure why I need the flatMap.

What is the best/fastest way to achieve this?

Simd
  • 19,447
  • 42
  • 136
  • 271
  • 2
    You only need `flatMap` is your column contain nested values. Refer to this question for other ways : https://stackoverflow.com/questions/36043256/making-histogram-with-spark-dataframe-column – Spandan Brahmbhatt Sep 13 '17 at 16:17

1 Answers1

-6

convert your data frame to pandas data frame as

df_pd = df.toPandas()

then use,

%matplotlib inline
import matplotlib.pyplot as plt
df_pd.hist(column='column name')

This should work

Sravan M
  • 1
  • 1
  • 2
    Conversion to Pandas DataFrame is very inefficient and also not guaranteed working in low memory environments. – ciurlaro Nov 29 '20 at 00:45