0

I'm trying to calculate a growing percentile for a column in my Polars DataFrame. The goal is to calculate the percentile from the beginning of the column up until the current observation.

Example Data:

import polars as pl

data = pl.DataFrame({
    "premia_pct": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
})

I want to create a new column "premia_percentile" that calculates the percentile of "premia_pct" from the start of the column up to the current row.

I tried using the cumulative_eval function in Polars as follows:

df = data.with_columns(
    pl.col("premia_pct").cumulative_eval(
        lambda group: (group.rank() / group.len()).last(),
        min_periods=1
    ).alias("premia_percentile")
)

However, I get the following error: AttributeError: 'function' object has no attribute '_pyexpr'

I have also tried a for loop:

        for i in range(1, data.shape[0] + 1):
            data["premia_percentile"][:i] = data["premia_pct"][:i].rank() / i

        return data

but this is not how poalrs is supposed to be used, and it doesn't work either. Even if I use pl.slice(1,i) instead of [:i]

Maybe you could use something similar to the pandas.expanding()?

This is what I expect the output to be:

    def _calculate_growing_percentile(
        self, data, column_name: str = "premia_percentile"
    ):
        """
        Calculate a growing percentile of a column in a DataFrame.

        Parameters:
        df (pl.DataFrame): The DataFrame.
        column_name (str): The name of the column.

        Returns:
        pl.DataFrame: The DataFrame with the new column.
        """
        # Initialize a pandas df
        data = data.to_pandas()

        # Calculate the growing percentile
        data[f"{column_name}"] = data["premia_pct"].expanding().rank(pct=True)
        data = pl.from_pandas(data)
        return data

Is there a way to calculate a growing percentile in Polars using cumulative_eval or any other function? Any help would be greatly appreciated.

Here is a similar post SO Question

JJ Fantini
  • 213
  • 1
  • 11
  • 1
    As for your first error, `.cumulative_eval()` takes an expression, not a function - so it would be `.cumulative_eval((pl.element().rank() / pl.element().len()).last(), min_periods=1)` - I'm not sure if this generates the result you expect though. It may help others if you also include the expected output. – jqurious Jul 21 '23 at 12:18
  • OKay, thanks! I will try it out... – JJ Fantini Jul 21 '23 at 13:40
  • It worked! Seems as though pl.element() is the same as an iteration through the column. Good to know :) – JJ Fantini Jul 21 '23 at 13:46

1 Answers1

0

Here is an equivalent function to the pandas implementation suggested above:

rank_calc = (pl.element().rank() / pl.element().len()).last()
df = data.with_columns(
    pl.col("premia_pct").cumulative_eval(rank_calc, min_periods=1).alias("premia_percentile")
)

The percentile rank calculation is:

(pl.element().rank() / pl.element().len()).last()
JJ Fantini
  • 213
  • 1
  • 11