Context
Pyspark uses arrow to convert to pandas. Polars is an abstraction over arrow memory. So we can hijack the API that spark uses internally to create the arrow data and use that to create the polars DataFrame
.
TLDR
Given an spark context we can write:
import pyarrow as pa
import polars as pl
sql_context = SQLContext(spark)
data = [('James',[1, 2]),]
spark_df = sql_context.createDataFrame(data=data, schema = ["name","properties"])
df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))
print(df)
shape: (1, 2)
┌───────┬────────────┐
│ name ┆ properties │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═══════╪════════════╡
│ James ┆ [1, 2] │
└───────┴────────────┘
Serialization steps
This will actually be faster than the toPandas
provided by spark
itself, because it saves an extra copy.
toPandas()
will lead to this serialization/copy step:
spark-memory -> arrow-memory -> pandas-memory
With the query provided we have:
spark-memory -> arrow/polars-memory