3

I want to apply a custom function which takes 2 columns and outputs a value based on those (row-based)

In Pandas there is a syntax to apply a function based on values in multiple columns

df['col_3'] = df.apply(lambda x: func(x.col_1, x.col_2), axis=1)

What is the syntax for this in Polars?

Maiia Bocharova
  • 149
  • 1
  • 7

1 Answers1

8

In polars, you don't add columns by assigning just the value of the new column. You always have to assign the whole df (in other words there's never ['col_3'] on the left side of the =)

To that end if you want your original df with a new column then you use the with_columns method.

If you combine that with the answer that was cited by @Nick ODell, specifically this one

you would do

df = df.with_columns(pl.struct(['col_1','col_2']) \
       .apply(lambda x: func(x['col_1'], x['col_2'])).alias('col_3'))

The pl.struct is going to convert each row into a struct (basically a dict) that has all the columns that you want it to have. When you do that, you can then do apply on that column of dicts and then feed your function to it that way where you reference each column as though it were a dict (because it is). Finally, you do alias on that to give it the name you want it to have.

All that being said, unless your function is very esoteric, you can, and should, just use the built in polars expressions to accomplish whatever the function is doing. It will be much faster as the computation happens with compiled code rather than executing the python function. It can also run through its internal query optimizer and, in some case, work in parallel on multiple processors.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • 1
    And in many cases you don't need the .apply(lambda x:) part because you can use the Polars expressions to implement that logic. Expressions are typically faster than a .apply because they run in Rust rather than Python – braaannigan Nov 15 '22 at 15:53
  • This is great but doesn't work for object types, which is a shame in my case because I want to use it on arrays in columns. Instead I just used iter_rows and created a new df. – wordsforthewise Sep 02 '23 at 02:28