How to convert Spark RDD to pandas dataframe in ipython?

Question

I have a RDD and I want to convert it to pandas dataframe. I know that to convert and RDD to a normal dataframe we can do

df = rdd1.toDF()

But I want to convert the RDD to pandas dataframe and not a normal dataframe. How can I do it?

score 47 · Answer 1 · edited Jun 20 '20 at 09:12

47

You can use function toPandas():

Returns the contents of this DataFrame as Pandas pandas.DataFrame.

This is only available if Pandas is installed and available.

>>> df.toPandas()  
   age   name
0    2  Alice
1    5    Bob

edited Jun 20 '20 at 09:12

Community

1
1

answered Jan 15 '16 at 19:10

jezrael

822,522
95
1,334
1,252

1

What is the difference between toDF() and toPandas()? – jtlz2 Mar 19 '19 at 10:21
@jezrael, how to convert only first 10 rows of `spark df to pandas df`? – Pyd Apr 29 '19 at 07:18

score 17 · Answer 2 · edited Nov 30 '17 at 09:01

17

You'll have to use a Spark DataFrame as an intermediary step between your RDD and the desired Pandas DataFrame.

For example, let's say I have a text file, flights.csv, that has been read in to an RDD:

flights = sc.textFile('flights.csv')

You can check the type:

type(flights)
<class 'pyspark.rdd.RDD'>

If you just use toPandas() on the RDD, it won't work. Depending on the format of the objects in your RDD, some processing may be necessary to go to a Spark DataFrame first. In the case of this example, this code does the job:

# RDD to Spark DataFrame
sparkDF = flights.map(lambda x: str(x)).map(lambda w: w.split(',')).toDF()

#Spark DataFrame to Pandas DataFrame
pdsDF = sparkDF.toPandas()

You can check the type:

type(pdsDF)
<class 'pandas.core.frame.DataFrame'>

edited Nov 30 '17 at 09:01

chhantyal

11,874
7
51
77

answered Jan 16 '16 at 05:06

RKD314

1,125
3
13
18

3

I think `pdsDF = sparkDF.toPandas` is missing the () to actually call the method. It should be: `pdsDF = sparkDF.toPandas()` – learn2day Jun 26 '17 at 21:17
What is the difference between toDF() and toPandas()? – jtlz2 Mar 19 '19 at 10:21
toDF() converts an RDD to a Spark DataFrame, and toPandas() converts a Spark DataFrame to a Pandas DataFrame. The two kinds of DataFrames are different types. – RKD314 Jun 27 '19 at 18:57

How to convert Spark RDD to pandas dataframe in ipython?

2 Answers2

Linked

Related