Lets assume we have below given dataframe. Now for each row I need to create dictionary and pass it to UDF for some logic processing.Is there a way to achieve this using either polars or pyspark dataframe ?
Asked
Active
Viewed 902 times
3
-
1can you show a reproducible example with expected output please – ignoring_gravity Mar 06 '23 at 10:44
2 Answers
4
With Polars
, you can use:
# Dict of lists
>>> df.transpose().to_dict(as_series=False)
{'column_0': [1.0, 100.0, 1000.0], 'column_1': [2.0, 200.0, None]}
# List of dicts
>>> df.to_dicts()
[{'Account number': 1, 'V1': 100, 'V2': 1000.0},
{'Account number': 2, 'V1': 200, 'V2': None}]
Input dataframe:
>>> df
shape: (2, 3)
┌────────────────┬─────┬────────┐
│ Account number ┆ V1 ┆ V2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞════════════════╪═════╪════════╡
│ 1 ┆ 100 ┆ 1000.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 200 ┆ null │
└────────────────┴─────┴────────┘

Corralien
- 109,409
- 8
- 28
- 52
0
In addition to the response by @Corralien, here is what you can do if you want to call your UDF directly from Polars:
import polars as pl
df = pl.DataFrame({
'Account number': [1,2],
'V1': [100,200],
'V2': [1000, None]
})
shape: (2, 3)
┌────────────────┬─────┬──────┐
│ Account number ┆ V1 ┆ V2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞════════════════╪═════╪══════╡
│ 1 ┆ 100 ┆ 1000 │
│ 2 ┆ 200 ┆ null │
└────────────────┴─────┴──────┘
# Define a UDF
def my_udf(row):
return (row['V1'] or 0) + (row['V2'] or 0)
# Add a column using the UDF
df.with_columns(
result_udf = pl.struct(pl.all()).apply(my_udf)
)
shape: (2, 4)
┌────────────────┬─────┬──────┬────────────┐
│ Account number ┆ V1 ┆ V2 ┆ result_udf │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════════╪═════╪══════╪════════════╡
│ 1 ┆ 100 ┆ 1000 ┆ 1100 │
│ 2 ┆ 200 ┆ null ┆ 200 │
└────────────────┴─────┴──────┴────────────┘
# You can also run your UDF on multiple threads
df.with_columns(
result_udf = pl.struct(pl.all())
.apply(my_udf, strategy='threading', return_dtype=pl.Int64)
)
About running a function on separate threads, below is what the Polars API say:
This functionality is in alpha stage. This may be removed /changed without it being considdered a breaking change.
‘threading’: run the python function on separate threads. Use with care as this can slow performance. This might only speed up your code if the amount of work per element is significant and the python function releases the GIL (e.g. via calling a c function)

Luca
- 1,216
- 6
- 10