14

Is there any way to plot information from Spark dataframe without converting the dataframe to pandas?

Did some online research but can't seem to find a way. I need to automatically save these plots as .pdf, so using the built-in visualization tool from databricks would not work.

Right now, this is what I'm doing (as an example):

# df = some Spark data frame 
df = df.toPandas()
df.plot()
display(plt.show())

I want to produce line graphs, histograms, bar charts and scatter plots without converting my dataframe to pandas dataframe. Thank you!

DennisLi
  • 3,915
  • 6
  • 30
  • 66
KikiNeko
  • 261
  • 1
  • 3
  • 7

3 Answers3

19

The display function is only available in databricks kernel notebook, not in spark

Gravity
  • 229
  • 2
  • 8
1

Just to use display(<dataframe-name>) function with a Spark dataframe as the offical document Visualizations said as below.

enter image description here

Then, to select the plot type and change its options as the figure below to show a chart with spark dataframe directly.

enter image description here

If you want to show the same chart as the pandas dataframe plot of yours, your current way is the only way.

Peter Pan
  • 23,476
  • 4
  • 25
  • 43
0

If the spark dataframe 'df' (as asked in question) is of type 'pyspark.pandas.frame.DataFrame', then try the following:

# Plot spark dataframe
df.column_name.plot.pie()

        where column_name is one of the columns in the spark dataframe 'df'.

You can try finding the type of 'df' by

type(df)

There are other functions like

        pyspark.pandas.DataFrame.plot.line

        pyspark.pandas.DataFrame.plot.bar

        pyspark.pandas.DataFrame.plot.scatter

This can be found on the apache spark docs: https://spark.apache.org/docs/3.2.1/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.plot.bar.html

If the spark dataframe 'df' is of type 'pyspark.sql.dataframe.DataFrame', then try the following:

# Import pyspark.pandas
import pyspark.pandas as ps

# Convert pyspark.sql.dataframe.DataFrame to pyspark.pandas.frame.DataFrame
temp_df = ps.DataFrame( df ).set_index('column_name')

# Plot spark dataframe
temp_df.column_name.plot.pie()

Note: There could be other better ways to do it as well. If there are kindly suggest them in the comment.

sunil karki
  • 337
  • 4
  • 3