How to transform Spark dataframe to Polars dataframe?

Question

I wonder how i can transform Spark dataframe to Polars dataframe.

Let's say i have this code on PySpark:

df = spark.sql('''select * from tmp''')

I can easily transform it to pandas dataframe using .toPandas. Is there something similar in polars, as I need to get a polars dataframe for further processing?

AFAIK from the doc, spark does not have polars support yet. – samkart Aug 02 '22 at 09:07 — samkart, Aug 02 '22 at 09:07

ritchie46 · Accepted Answer · 2022-08-08T05:10:13.523

Context

Pyspark uses arrow to convert to pandas. Polars is an abstraction over arrow memory. So we can hijack the API that spark uses internally to create the arrow data and use that to create the polars DataFrame.

TLDR

Given an spark context we can write:

import pyarrow as pa
import polars as pl

sql_context = SQLContext(spark)

data = [('James',[1, 2]),]
spark_df = sql_context.createDataFrame(data=data, schema = ["name","properties"])

df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))

print(df)

shape: (1, 2)
┌───────┬────────────┐
│ name  ┆ properties │
│ ---   ┆ ---        │
│ str   ┆ list[i64]  │
╞═══════╪════════════╡
│ James ┆ [1, 2]     │
└───────┴────────────┘

Serialization steps

This will actually be faster than the toPandas provided by spark itself, because it saves an extra copy.

toPandas() will lead to this serialization/copy step:

spark-memory -> arrow-memory -> pandas-memory

With the query provided we have:

spark-memory -> arrow/polars-memory

score 7 · Answer 2 · answered Sep 23 '22 at 09:49

Polars is not distributed, while Spark is

Note that Polars is a single-machine multi-threaded DataFrame library. Spark in contrast is a multi-machine multi-threaded DataFrame library. So Spark distributes the DataFrame across multiple machines.

Transform Spark DataFrame with Polars code scalable

If your dataset requires this feature, because the DataFrame does not fit onto a single machine, then _collect_as_arrow, to_dict and from_pandas do not work for you.

If you want to transform your Spark DataFrame using some Polars code (Spark -> Polars -> Spark), you can do this distributed and scalable using mapInArrow:

import pyarrow as pa
import polars as pl

from typing import Iterator


# The example data as a Spark DataFrame
data = [(1, 1.0), (2, 2.0)]
spark_df = spark.createDataFrame(data=data, schema = ['id', 'value'])
spark_df.show()


# Define your transformation on a Polars DataFrame
# Here we multply the 'value' column by 2
def polars_transform(df: pl.DataFrame) -> pl.DataFrame:
  return df.select([
    pl.col('id'),
    pl.col('value') * 2
  ])


# Converts a part of the Spark DataFrame into a Polars DataFrame and call `polars_transform` on it
def arrow_transform(iter: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]:
  # Transform a single RecordBatch so data fit into memory
  # Increase spark.sql.execution.arrow.maxRecordsPerBatch if batches are too small
  for batch in iter:
    polars_df = pl.from_arrow(pa.Table.from_batches([batch]))
    polars_df_2 = polars_transform(polars_df)
    for b in polars_df_2.to_arrow().to_batches():
      yield b


# Map the Spark DataFrame to Arrow, then to Polars, run the the `polars_transform` on it,
# and transform everything back to Spark DataFrame, all distributed and scalable
spark_df_2 = spark_df.mapInArrow(arrow_transform, schema='id long, value double')
spark_df_2.show()

viggnah · Answer 3 · 2022-08-02T10:13:34.960

You can't directly convert from spark to polars. But you can go from spark to pandas, then create a dictionary out of the pandas data, and pass it to polars like this:

pandas_df = df.toPandas()
data = pandas_df.to_dict('list')
pl_df = pl.DataFrame(data)

As @ritchie46 pointed out, you can use pl.from_pandas() instead of creating a dictionary:

pandas_df = df.toPandas()
pl_df = pl.from_pandas(pandas_df)

Also, as mentioned in @DataPsycho's answer, this may cause out of memory exception for large datasets. This is because toPandas() will collect the data to the driver first. In this case, it is better to write to csv or parquet file and then read back. But avoid repartition(1) because this will move the data to the driver too.

The code I have provided is suitable for datasets that will fit in your driver memory. If you have the option to increase the driver memory you can do so by increasing the value of spark.driver.memory.

You should never go to polars via a python dictionary. Polars as a `pl.from_pandas` argument. That will save you a lot of heap allocations and ensure type correctness. — ritchie46, Aug 02 '22 at 09:59
Yes, I thought about converting my data into a pandas dataframe first, but I don't think that would work with the amount of data I'm working with:( Hopefully, Spark will add polar support soon. — s1nbad, Aug 02 '22 at 11:03

score 0 · Answer 4 · answered Aug 02 '22 at 09:46

It will be good to know your usecase. Heavy transformations you should do either with spark or polars. You should not be mixing both dataframes. What ever polars can do spark can do all of them. So you should be doing all of your transformation with spark. Then write the file as csv or parquet format. Then You should read the transformed file with Polars and everything will run blazing fast, But if you are interested in plotting then read it directly into pandas and use matplotlib. So if you will have a spark dataframe you can write it as csv:

(transformed_df
    .repartition(1)
    .write
    .option("header",true)
    .option("delimiter",",") # by default it is ,
    .csv("<your_path>")
)

Now read it with polars or pandas with read_csv. If you will have small amount of memory in the drive node of spark then transformed_df.toPandas() might fail because of not having much memory.

I mostly work with spark, but sometimes I have to create a pandas dataframe for some extra analysis/drawing of graphs. So I wanted to know if there is such a possibility while working with polar :) — s1nbad, Aug 02 '22 at 10:56