0

I am trying to understand if there is any way to do when..then..otherwise in polars and assign to multiple columns. I have a elo dataset with millions of rows where I want to assign the current elo to anything greater than date. In pandas, I would do

elo_df.loc[(id, date:)), ["elo", "true_skill_mu", "true_skill_sigma"]] = elo, true_skill_mu, true_skill_sigma

The code below works but is very slow. I am hoping I can increase the speed by at least 3x by making the filter happen once. Also, if have any suggestion on to how tomake this faster, please let me know.

elo_df = elo_df.with_columns([pl.when((pl.col("id") == col) & (pl.col("date") >= date)).then(pl.lit(new_rating)).otherwise(pl.col("elo")).alias("elo"),
                                           
pl.when((pl.col("id") == col) & (pl.col("date") >= date)).then(pl.lit(new_mu)).otherwise(pl.col("true_skill_mu")).alias("true_skill_mu"),
                                            pl.when((pl.col("id") == col) & (pl.col("date") >= date)).then(pl.lit(new_sigma)).otherwise(pl.col("true_skill_sigma")).alias("true_skill_sigma")]
Michael WS
  • 2,450
  • 4
  • 24
  • 46
  • Since they are in a `with_columns` context, the three when/then/otherwise expressions will run in parallel (as long as your CPU has at least 3 cores). So from a wall-clock standpoint, you will not gain much by trying to rewrite them as one filter. That said, are you updating large batches of id's at one time? If so, then there is a speed-up for that. –  Jul 24 '22 at 19:34
  • I am updating one id at a time – Michael WS Jul 24 '22 at 19:44
  • @cbilot Is there a faster way to update than these filters? – Michael WS Jul 24 '22 at 22:26
  • Hmm, I'm somewhat puzzled. I created a dataset of 605 million records, and ran the three when/then/otherwise expressions are you described. I'm getting times of about 3-4 seconds. Are you getting something significantly worse? –  Jul 24 '22 at 23:21
  • My issue is I am probably doing something silly. I am rolling thru each date and updating all going forward on each date – Michael WS Jul 24 '22 at 23:52
  • If you can post your code along with a mock dataset of about 10 records and a description of what you’re trying to achieve, I might be able to help. –  Jul 24 '22 at 23:58
  • https://gist.github.com/MichaelWS/c3591a048390de568803e5db7d537e4f – Michael WS Jul 25 '22 at 00:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/246722/discussion-between-michael-ws-and-cbilot). – Michael WS Jul 25 '22 at 00:12

1 Answers1

1

use polars.col followed by polars.Expr.map

def on_when(s : pl.Series, df = elo_df):
    # or pass them as default arguments
    global new_rating, new_mu, new_sigma
    if s.name == "elo":
        return new_rating
    elif s.name == "true_skill_mu":
        return new_mu
    elif s.name == "true_skill_sigma":
        return new_sigma


def otherwise(s : pl.Series, df = elo_df):
    if s.name == "elo":
        return df.get_column("elo")
    elif s.name == "true_skill_mu":
        return df.get_column("true_skill_mu")
    elif s.name == "true_skill_sigma":
        return df.get_column("true_skill_sigma")
    # or you can simply just use `return s` since we are not doing any operations
    # this is just an example if we want to do calculations based on other columns


elo_df = elo_df.with_columns([
    pl.when((pl.col("id") == col) & (pl.col("date") >= date))
    .then(pl.col(["elo", "true_skill_mu", "true_skill_sigma"]).map(on_when))
    .otherwise(pl.col(["elo", "true_skill_mu", "true_skill_sigma"]).map(otherwise))
])
David Waterworth
  • 2,214
  • 1
  • 21
  • 41
lost bit
  • 123
  • 1
  • 8