How to convert a spark dataframe into a databrick koalas dataframe?

Question

I know that you can convert a spark dataframe df into a pandas dataframe with

df.toPandas()

However, this is taking very long, so I found out about a koala package in databricks that could enable me to use the data as a pandas dataframe (for instance, being able to use scikit learn) without having a pandas dataframe. I already have the spark dataframe, but I cannot find a way to make it into a Koalas one.

score 23 · Accepted Answer · answered Jul 02 '19 at 14:58

23

To go straight from a pyspark dataframe (I am assuming that is what you are working with) to a koalas dataframe you can use:

koalas_df = ks.DataFrame(your_pyspark_df)

Here I've imported koalas as ks.

answered Jul 02 '19 at 14:58

Kate

512
4
12

How to convert a Koalas dataframe to a Spark dataframe? – Eric Bellet Oct 24 '19 at 20:26
2

To convert from a koalas DF to spark DF: `your_pyspark_df = koalas_df.to_spark()` – Kate Oct 25 '19 at 17:41

score 3 · Answer 2 · edited Nov 09 '19 at 19:07

3

Well. First of all, you have to understand the reason why toPandas() takes so long :

Spark dataframe are distributed in different nodes and when you run toPandas()
It will pull the distributed dataframe back to the driver node (that's the reason it takes long time)
you are then able to use pandas, or Scikit-learn in the single(Driver) node for faster analysis and modeling, because it's like your modeling on your own PC
Koalas is the pandas API in spark and when you convert it to koalas dataframe : It's still distributed, so it will not shuffle data between different nodes, so you can use pandas' similar syntax for distributed dataframe transformation

edited Nov 09 '19 at 19:07

Mehdi LAMRANI

11,289
14
88
130

answered Oct 04 '19 at 04:44

seninus

256
2
5

1

What happens under the hood when you convert a Koalas Dataframe to a Spark dataframe using koalas_df.to_spark()?? – Eric Bellet Oct 25 '19 at 17:47
3

It's just mimicing the Pandas style so people are familiar with Python pandas are easier to get hands on for Spark Dataframe processing. Under the hood, they are still the transformation of Spark Dataframe that are immutable... – seninus Oct 28 '19 at 01:28

How to convert a spark dataframe into a databrick koalas dataframe?

2 Answers2