Alpha factors need section rank sometimes, like this:
import pandas as pd
df = pd.Dataframe(some_data)
df.rank(axis=1, pct=True)
how to implement this with polars efficiently?
Alpha factors need section rank sometimes, like this:
import pandas as pd
df = pd.Dataframe(some_data)
df.rank(axis=1, pct=True)
how to implement this with polars efficiently?
A polars DataFrame has properties being:
For this reason polars does not want the axis=1
API from pandas. It does not make much sense to compute the rank between numeric, string, boolean or even complexer nested types like structs and lists.
Pandas solves this by giving you numeric_only
keyword argument.
Polars' is more opinionated and wants to nudge you in using the expression API.
Polars expressions work on columns that have the guarantee that they consist of homogeneous data. Columns have this guarantee, rows in a DataFrame
not so much. Luckily we have a data type that has the guarantee that the rows are homogeneous: pl.List
data type.
Let's say we have the following data:
grades = pl.DataFrame({
"student": ["bas", "laura", "tim", "jenny"],
"arithmetic": [10, 5, 6, 8],
"biology": [4, 6, 2, 7],
"geography": [8, 4, 9, 7]
})
print(grades)
shape: (4, 4)
┌─────────┬────────────┬─────────┬───────────┐
│ student ┆ arithmetic ┆ biology ┆ geography │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪════════════╪═════════╪═══════════╡
│ bas ┆ 10 ┆ 4 ┆ 8 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ laura ┆ 5 ┆ 6 ┆ 4 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ 6 ┆ 2 ┆ 9 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny ┆ 8 ┆ 7 ┆ 7 │
└─────────┴────────────┴─────────┴───────────┘
If we want to compute the rank of all the columns except for student, we can collect those into a list
data type:
This would give:
grades.select([
pl.concat_list(pl.all().exclude("student")).alias("all_grades")
])
shape: (4, 1)
┌────────────┐
│ all_grades │
│ --- │
│ list [i64] │
╞════════════╡
│ [10, 4, 8] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [5, 6, 4] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [6, 2, 9] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 7, 7] │
└────────────┘
We can run any polars expression on the elements of a list with the arr.eval
expression! These expressions entirely run on polars' query engine and can run in parallel so will be super fast.
Polars doesn't provide a keyword argument the compute the percentages of the ranks. But because expressions are so versatile we can create our own percentage rank expression.
Note that we must select
the list's element from the context. When we apply expressions over list elements. Any col()/first()
selection suffices.
# the percentage rank expression
rank_pct = pl.col("").rank(reverse=True) / pl.col("").count()
grades.with_column(
# create the list of homogeneous data
pl.concat_list(pl.all().exclude("student")).alias("all_grades")
).select([
# select all columns except the intermediate list
pl.all().exclude("all_grades"),
# compute the rank by calling `arr.eval`
pl.col("all_grades").arr.eval(rank_pct, parallel=True).alias("grades_rank")
])
This outputs:
shape: (4, 5)
┌─────────┬────────────┬─────────┬───────────┬────────────────────────────────┐
│ student ┆ arithmetic ┆ biology ┆ geography ┆ grades_rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list [f32] │
╞═════════╪════════════╪═════════╪═══════════╪════════════════════════════════╡
│ bas ┆ 10 ┆ 4 ┆ 8 ┆ [0.333333, 1.0, 0.666667] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ laura ┆ 5 ┆ 6 ┆ 4 ┆ [0.666667, 0.333333, 1.0] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ 6 ┆ 2 ┆ 9 ┆ [0.666667, 1.0, 0.333333] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny ┆ 8 ┆ 7 ┆ 7 ┆ [0.333333, 0.833333, 0.833333] │
└─────────┴────────────┴─────────┴───────────┴────────────────────────────────┘
Note that this solution works for any expressions/operation you want to do row wise.