Spark provides a few different ways to implement UDFs that consume and return Pandas DataFrames. I am currently using the cogrouped version that takes two (co-grouped) Pandas DataFrames as input and returns a third.
For efficient translation between Spark DataFrames and Pandas DataFrames, Spark uses the Apache Arrow memory layout, however transformation is still required to go from Arrow to Pandas and back. I would really like to access the Arrow data directly, as this is how I will ultimately be working with the data in the UDF (using Polars).
It seems wasteful to go from Spark -> Arrow -> Pandas -> Arrow (Polars) on the way in and the reverse on the return.