3

Let's say I have a Polars dataframe like so:

df = pl.DataFrame({
    'a': [0.3, 0.7, 0.5, 0.1, 0.9]
})

And now I need to add a new column where 1 or 0 is assigned depending on whether a value in column 'a' is greater or less than some threshold. In Pandas I can do this:

import numpy as np

THRESHOLD = 0.5
df['new'] = np.where(df.a > THRESHOLD, 0, 1)

I can also do something very similar in Polars:

df = df.with_columns(
    pl.lit(np.where(df.select('a').to_numpy() > THRESHOLD, 0, 1).ravel())
    .alias('new')
)

This works fine but I'm sure that using NumPy here is not the best practice.

I've also tried something more like:

df = df.with_columns(
    pl.lit(df.filter(pl.col('a') > THRESHOLD).select([0, 1]))
    .alias('new')
)

But with this syntax I keep running into the following error:

DuplicateError                            Traceback (most recent call last)
Cell In[47], line 5
      1 THRESHOLD = 0.5
      2 DELAY_TOLERANCE = 10
      4 df = df.with_columns(
----> 5     pl.lit(df.filter(pl.col('a') > THRESHOLD).select([0, 1]))
      6     .alias('new')
      7 )
      8 df.head()

DuplicateError: column with name 'literal' has more than one occurrences

So my question is two-fold: what am I doing wrong here and what is the best practice in Polars for such conditional assignments?

I did looks through docs and previous questions but couldn't find anything resembling my use-case.

NotAName
  • 3,821
  • 2
  • 29
  • 44

1 Answers1

4

The select([0, 1]) doesn't really make a lot of sense Polars-wise, you're just selecting a literal. Not quite sure why that's throwing a DuplicateError as is.

Conditionals in polars are best done with when:

df.with_columns(pl.when(pl.col("a") > 0.5).then(0).otherwise(1).alias("b"))

Wayoshi
  • 1,688
  • 1
  • 7