2

How to correctly transform a Polars DataFrame to a pySpark DataFrame?

More specifically, the conversion methods which I've tried all seem to have problems parsing columns containing arrays / lists.

create spark dataframe

data = [{"id": 1, "strings": ['A', 'C'], "floats": [0.12, 0.43]},
        {"id": 2, "strings": ['B', 'B'], "floats": [0.01]},
        {"id": 3, "strings": ['C'], "floats": [0.09, 0.01]}
        ]

sparkdf = spark.createDataFrame(data)

convert it to polars

import pyarrow as pa
import polars as pl
pldf = pl.from_arrow(pa.Table.from_batches(sparkdf._collect_as_arrow()))

try to convert back to spark dataframe (attempt 1)

spark.createDataFrame(pldf.to_pandas())


TypeError: Can not infer schema for type: <class 'numpy.ndarray'>
TypeError: Unable to infer the type of the field floats.

try to convert back to spark dataframe (attempt 2)

schema = sparkdf.schema
spark.createDataFrame(pldf.to_pandas(), schema)

TypeError: field floats: ArrayType(DoubleType(), True) can not accept object array([0.12, 0.43]) in type <class 'numpy.ndarray'>

relevant: How to transform Spark dataframe to Polars dataframe?

  • I had a similar problem. I ended up saving the dataframe as Parquet, and then loading it back with Spark. @Dean MacGregor answer works well for smaller datasets, but for big datasets converting to dicts takes a long time. I am hoping Spark allows to create a dataframe from an Arrow table – Luca Feb 14 '23 at 10:46
  • Unfortunately the reason I want to convert it to Spark in the first place is so that I can save it :). I'm in Azure Databricks and need to save to Azure Storage gen2. Spark handles this perfectly, but Polars does not. – ihopethiswillfi Feb 15 '23 at 17:22
  • 1
    I also use Azure Databricks and I am able to save the files with Polars. However there is a difference in the path: with Polars `df.collect().write_parquet(f'/dbfs/mnt/..)` with Spark I start with /mnt and skip the /dbfs/ – Luca Feb 15 '23 at 17:34

3 Answers3

2

I discovered the right way to do this while reading 'In-Memory Analytics with Apache Arrow' by Matthew Topol.

You can do:

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
df_spark = spark.createDataFrame(df_polars.to_pandas())

It's quite fast.

I also tried converting to an Arrow-backed Pandas dataframe first (i.e. df_polars.to_pandas(use_pyarrow_extension_array=True)) but it does not work: Spark complains that it does not know how to handle column types such as large strings (the UTF8 in Polars) or unsigned integers.

Not setting spark.sql.execution.arrow.pyspark.enabled to true increased the time 90-fold in my test (from 1.5 seconds to 2 minutes 18 seconds).

Luca
  • 1,216
  • 6
  • 10
1

What about

spark.createDataFrame(pldf.to_dicts())

Alternatively you could do:

spark.createDataFrame({x:y.to_list() for x,y in pldf.to_dict().items()})

Since the to_dict method returns polars Series instead of lists, I'm using a comprehension to convert the Series into regular lists which spark comprehends.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • i get: ```raise TypeError("Can not infer schema for type: %s" % type(row)) TypeError: Can not infer schema for type: ``` – rnd om Apr 20 '23 at 19:53
0

DataFrame.transform(func: Callable [ […], DataFrame], *args: Any, **kwargs: Any) → pyspark.sql.dataframe.DataFrame [source] ¶ Returns a new DataFrame. Concise syntax for chaining custom transformations.

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 10 '22 at 22:05