The answer depends on whether you can rewrite your function using Polars expressions.
Using Polars Expressions
To obtain the best performance with Polars, try to code your calculations using Expressions. Expressions yield the most performant, embarrassingly parallel solutions.
For example, your function could be expressed as:
shift_len = 3
df.with_columns(
[
(pl.col("a") + (pl.col("b") * shift_len)).alias("d"),
(pl.col("b") - shift_len).alias("e"),
]
)
shape: (5, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 7 ┆ -1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 11 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 15 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 5 ┆ 6 ┆ 19 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 6 ┆ 7 ┆ 23 ┆ 3 │
└─────┴─────┴─────┴─────┴─────┘
Polars will run both expressions in parallel, yielding very fast results.
Using apply
Let's assume that you cannot code your function as Polars Expressions (e.g., you need to use an external library). Since your function takes multiple parameters and returns multiple values, we'll take this in steps.
Passing multiple values
We can pass multiple values to the the fun
function in the apply
by "stamp-coupling" multiple columns into a single series using polars.struct
. In the lambda function, the values are passed as a Python dict
, with the names of the columns as the keys. So, for example, we access the value in column a
in the lambda below as cols["a"]
.
df.with_column(
pl.struct(["a", "b"])
.apply(lambda cols: fun(cols["a"], cols["b"], 3))
.alias("result")
)
shape: (5, 4)
┌─────┬─────┬─────┬─────────┐
│ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ object │
╞═════╪═════╪═════╪═════════╡
│ 1 ┆ 2 ┆ 3 ┆ (7, -1) │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ (11, 0) │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ (15, 1) │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 5 ┆ 6 ┆ (19, 2) │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 6 ┆ 7 ┆ (23, 3) │
└─────┴─────┴─────┴─────────┘
The result
column contains the tuples returned by the fun
function. However, note the type of the result
column: object
. Columns of type object
are not very useful in Polars, and have limited functionality.
Handling multiple return values
Next we'll convert the tuple returned by the fun
function to something more useful: a dictionary of key-value pairs, where the keys are the desired column names (d
and e
in your example).
We'll accomplish this by using Python's zip
function and a tuple with the desired names.
When we run this code, we will get a column of type struct
.
df.with_column(
pl.struct(["a", "b"])
.apply(lambda cols: dict(zip(("d", "e"), fun(cols["a"], cols["b"], 3))))
.alias("result")
)
df
shape: (5, 4)
┌─────┬─────┬─────┬───────────┐
│ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ struct[2] │
╞═════╪═════╪═════╪═══════════╡
│ 1 ┆ 2 ┆ 3 ┆ {7,-1} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ {11,0} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ {15,1} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 5 ┆ 6 ┆ {19,2} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 6 ┆ 7 ┆ {23,3} │
└─────┴─────┴─────┴───────────┘
The names d
and e
do not appear in the output of the result
column, but they are there.
Using unnest
In the last step, we'll use the unnest
function to break the struct into two new columns.
df.with_column(
pl.struct(["a", "b"])
.apply(lambda cols: dict(zip(("d", "e"), fun(cols["a"], cols["b"], 3))))
.alias("result")
).unnest("result")
shape: (5, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 7 ┆ -1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 11 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 15 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 5 ┆ 6 ┆ 19 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 6 ┆ 7 ┆ 23 ┆ 3 │
└─────┴─────┴─────┴─────┴─────┘
One caution: using apply
with external libraries and/or custom Python bytecode subjects your code to the Python GIL. The result is very slow, single-threaded performance - no matter how it is coded. As such, I strongly suggest avoiding the use of apply
and custom Python functions, and instead trying to code your algorithms using only Polars Expressions, if you can.