5

Spark provides a few different ways to implement UDFs that consume and return Pandas DataFrames. I am currently using the cogrouped version that takes two (co-grouped) Pandas DataFrames as input and returns a third.

For efficient translation between Spark DataFrames and Pandas DataFrames, Spark uses the Apache Arrow memory layout, however transformation is still required to go from Arrow to Pandas and back. I would really like to access the Arrow data directly, as this is how I will ultimately be working with the data in the UDF (using Polars).

It seems wasteful to go from Spark -> Arrow -> Pandas -> Arrow (Polars) on the way in and the reverse on the return.

Plug1
  • 83
  • 5
  • 1
    That's an interesting question. All we would need is to be able to go `spark -> arrow -> spark`, as polars has mostly zero copy interop with arrow. – ritchie46 Mar 25 '22 at 09:04

1 Answers1

1
import pyarrow as pa
import polars as pl

sql_context = SQLContext(spark)

data = [('James',[1, 2]),]
spark_df = sql_context.createDataFrame(data=data, schema = ["name","properties"])

df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))

print(df)

shape: (1, 2)
┌───────┬────────────┐
│ name  ┆ properties │
│ ---   ┆ ---        │
│ str   ┆ list[i64]  │
╞═══════╪════════════╡
│ James ┆ [1, 2]     │
└───────┴────────────┘

ritchie46
  • 10,405
  • 1
  • 24
  • 43