3

Suppose we have this dataframe in polars (python):

import polars as pl
df = pl.DataFrame(
    {   
        "era": ["01", "01", "02", "02", "03", "03"],
        "pred": [3,5,6,8,9,1]
    }
)

I can create a rank/row_number based on one column, like:

df.with_columns(rn = pl.col("era").rank("ordinal"))

But if I want to do it based on two columns, it is not working:

df.with_columns(rn = pl.col(["era","pred"]).rank("ordinal"))

I get this error message:

ComputeError: The name: 'rn' passed to `LazyFrame.with_columns` is duplicate
Error originated just after this operation:
DF ["era", "pred"]; PROJECT */2 COLUMNS; SELECTION: "None"

Any suggestions on how to do this?

Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
lmocsi
  • 550
  • 2
  • 17

1 Answers1

5

Using multiple selectors inside a single pl.col call:

pl.col("one", "two").function()

is essentially short-hand for:

pl.col("one").function(), pl.col("two").function()

You can use pl.struct to "combine/group" multiple columns together:

>>> df.with_columns(rn = pl.struct("era", "pred"))
shape: (6, 3)
┌─────┬──────┬───────────┐
│ era ┆ pred ┆ rn        │
│ --- ┆ ---  ┆ ---       │
│ str ┆ i64  ┆ struct[2] │
╞═════╪══════╪═══════════╡
│ 01  ┆ 3    ┆ {"01",3}  │
│ 01  ┆ 5    ┆ {"01",5}  │
│ 02  ┆ 6    ┆ {"02",6}  │
│ 02  ┆ 8    ┆ {"02",8}  │
│ 03  ┆ 9    ┆ {"03",9}  │
│ 03  ┆ 1    ┆ {"03",1}  │
└─────┴──────┴───────────┘

Which you can then .rank

>>> df.with_columns(rn = pl.struct("era", "pred").rank("ordinal"))
shape: (6, 3)
┌─────┬──────┬─────┐
│ era ┆ pred ┆ rn  │
│ --- ┆ ---  ┆ --- │
│ str ┆ i64  ┆ u32 │
╞═════╪══════╪═════╡
│ 01  ┆ 3    ┆ 1   │
│ 01  ┆ 5    ┆ 2   │
│ 02  ┆ 6    ┆ 3   │
│ 02  ┆ 8    ┆ 4   │
│ 03  ┆ 9    ┆ 6   │
│ 03  ┆ 1    ┆ 5   │
└─────┴──────┴─────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14
  • And what if you want this rank to be by "era" ascending and by "pred" descending ordered? – lmocsi Apr 25 '23 at 13:59
  • 2
    multiply it by -1. So instead of `pl.struct("era", "pred")` do `pl.struct("era", pl.col("pred")*-1)` If the underlying column wasn't numeric then you'd do a rank first ie `pl.struct("era", pl.col("pred").rank()*-1)` – Dean MacGregor Apr 25 '23 at 14:29