0

Alpha factors need section rank sometimes, like this:

import pandas as pd

df = pd.Dataframe(some_data)

df.rank(axis=1, pct=True) 

how to implement this with polars efficiently?

ritchie46
  • 10,405
  • 1
  • 24
  • 43

1 Answers1

2

A polars DataFrame has properties being:

  • columns consist of homogeneous data (e.g. every column is a single type).
  • rows consist of heterogenous data (e.g. data types on a row may differ).

For this reason polars does not want the axis=1 API from pandas. It does not make much sense to compute the rank between numeric, string, boolean or even complexer nested types like structs and lists.

Pandas solves this by giving you numeric_only keyword argument.

Polars' is more opinionated and wants to nudge you in using the expression API.

Expression

Polars expressions work on columns that have the guarantee that they consist of homogeneous data. Columns have this guarantee, rows in a DataFrame not so much. Luckily we have a data type that has the guarantee that the rows are homogeneous: pl.List data type.

Let's say we have the following data:

grades = pl.DataFrame({
    "student": ["bas", "laura", "tim", "jenny"],
    "arithmetic": [10, 5, 6, 8],
    "biology": [4, 6, 2, 7],
    "geography": [8, 4, 9, 7]
})
print(grades)
shape: (4, 4)
┌─────────┬────────────┬─────────┬───────────┐
│ student ┆ arithmetic ┆ biology ┆ geography │
│ ---     ┆ ---        ┆ ---     ┆ ---       │
│ str     ┆ i64        ┆ i64     ┆ i64       │
╞═════════╪════════════╪═════════╪═══════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ laura   ┆ 5          ┆ 6       ┆ 4         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ tim     ┆ 6          ┆ 2       ┆ 9         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny   ┆ 8          ┆ 7       ┆ 7         │
└─────────┴────────────┴─────────┴───────────┘

If we want to compute the rank of all the columns except for student, we can collect those into a list data type:

This would give:

grades.select([
    pl.concat_list(pl.all().exclude("student")).alias("all_grades")
])
shape: (4, 1)
┌────────────┐
│ all_grades │
│ ---        │
│ list [i64] │
╞════════════╡
│ [10, 4, 8] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [5, 6, 4]  │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [6, 2, 9]  │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 7, 7]  │
└────────────┘

Running polars expression on list elements

We can run any polars expression on the elements of a list with the arr.eval expression! These expressions entirely run on polars' query engine and can run in parallel so will be super fast.

Polars doesn't provide a keyword argument the compute the percentages of the ranks. But because expressions are so versatile we can create our own percentage rank expression.

Note that we must select the list's element from the context. When we apply expressions over list elements. Any col()/first() selection suffices.

# the percentage rank expression
rank_pct = pl.col("").rank(reverse=True) / pl.col("").count()


grades.with_column(
    # create the list of homogeneous data
    pl.concat_list(pl.all().exclude("student")).alias("all_grades")
).select([
    # select all columns except the intermediate list
    pl.all().exclude("all_grades"),
    # compute the rank by calling `arr.eval`
    pl.col("all_grades").arr.eval(rank_pct, parallel=True).alias("grades_rank")
])

This outputs:

shape: (4, 5)
┌─────────┬────────────┬─────────┬───────────┬────────────────────────────────┐
│ student ┆ arithmetic ┆ biology ┆ geography ┆ grades_rank                    │
│ ---     ┆ ---        ┆ ---     ┆ ---       ┆ ---                            │
│ str     ┆ i64        ┆ i64     ┆ i64       ┆ list [f32]                     │
╞═════════╪════════════╪═════════╪═══════════╪════════════════════════════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         ┆ [0.333333, 1.0, 0.666667]      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ laura   ┆ 5          ┆ 6       ┆ 4         ┆ [0.666667, 0.333333, 1.0]      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim     ┆ 6          ┆ 2       ┆ 9         ┆ [0.666667, 1.0, 0.333333]      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny   ┆ 8          ┆ 7       ┆ 7         ┆ [0.333333, 0.833333, 0.833333] │
└─────────┴────────────┴─────────┴───────────┴────────────────────────────────┘

Note that this solution works for any expressions/operation you want to do row wise.

ritchie46
  • 10,405
  • 1
  • 24
  • 43