10

I know that you can convert a spark dataframe df into a pandas dataframe with

df.toPandas()

However, this is taking very long, so I found out about a koala package in databricks that could enable me to use the data as a pandas dataframe (for instance, being able to use scikit learn) without having a pandas dataframe. I already have the spark dataframe, but I cannot find a way to make it into a Koalas one.

Antonio López Ruiz
  • 1,396
  • 5
  • 20
  • 36

2 Answers2

23

To go straight from a pyspark dataframe (I am assuming that is what you are working with) to a koalas dataframe you can use:

koalas_df = ks.DataFrame(your_pyspark_df)

Here I've imported koalas as ks.

Kate
  • 512
  • 4
  • 12
3

Well. First of all, you have to understand the reason why toPandas() takes so long :

  • Spark dataframe are distributed in different nodes and when you run toPandas()
  • It will pull the distributed dataframe back to the driver node (that's the reason it takes long time)

  • you are then able to use pandas, or Scikit-learn in the single(Driver) node for faster analysis and modeling, because it's like your modeling on your own PC

  • Koalas is the pandas API in spark and when you convert it to koalas dataframe : It's still distributed, so it will not shuffle data between different nodes, so you can use pandas' similar syntax for distributed dataframe transformation
Mehdi LAMRANI
  • 11,289
  • 14
  • 88
  • 130
seninus
  • 256
  • 2
  • 5
  • 1
    What happens under the hood when you convert a Koalas Dataframe to a Spark dataframe using koalas_df.to_spark()?? – Eric Bellet Oct 25 '19 at 17:47
  • 3
    It's just mimicing the Pandas style so people are familiar with Python pandas are easier to get hands on for Spark Dataframe processing. Under the hood, they are still the transformation of Spark Dataframe that are immutable... – seninus Oct 28 '19 at 01:28