0

[enter image description here]

1

I am trying to add a column (column C) to my polars dataframe that counts how many times a value of one of the dataframe's columns (column A) is greater/less than the value of another column (column B). Once the value turns from less/greater to greater/less the cumulative sum should reset and start counting from 1/-1 again.

eshirvana
  • 23,227
  • 3
  • 22
  • 38
bruppfab
  • 3
  • 2

2 Answers2

1

The data

I'm going to change the data in the example you provided.

df = pl.DataFrame(
    {
        "a": [11, 10, 10, 10, 9, 8, 8, 8, 8, 8, 15, 15, 15],
        "b": [11, 9, 9, 9, 9, 9, 10, 8, 8, 10, 11, 11, 15],
    }
)
print(df)
shape: (13, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 11  ┆ 11  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10  ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10  ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10  ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 9   ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8   ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8   ┆ 10  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8   ┆ 8   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8   ┆ 8   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8   ┆ 10  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15  ┆ 11  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15  ┆ 11  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15  ┆ 15  │
└─────┴─────┘

Notice the cases where the two columns are the same. Your post didn't address what to do in these cases, so I made some assumptions as to what should happen. (You can adapt the code to handle those cases differently.)

The algorithm

df = (
    df
    .with_column((pl.col("a") - pl.col("b")).sign().alias("sign_a_minus_b"))
    .with_column(
        pl.when(pl.col("sign_a_minus_b") == 0)
        .then(None)
        .otherwise(pl.col("sign_a_minus_b"))
        .forward_fill()
        .alias("run_type")
    )
    .with_column(
        (pl.col("run_type") != pl.col("run_type").shift_and_fill(1, 0))
        .cumsum()
        .alias("run_id")
    )
    .with_column(pl.col("sign_a_minus_b").cumsum().over("run_id").alias("result"))
)
print(df)
shape: (13, 6)
┌─────┬─────┬────────────────┬──────────┬────────┬────────┐
│ a   ┆ b   ┆ sign_a_minus_b ┆ run_type ┆ run_id ┆ result │
│ --- ┆ --- ┆ ---            ┆ ---      ┆ ---    ┆ ---    │
│ i64 ┆ i64 ┆ i64            ┆ i64      ┆ u32    ┆ i64    │
╞═════╪═════╪════════════════╪══════════╪════════╪════════╡
│ 11  ┆ 11  ┆ 0              ┆ null     ┆ 1      ┆ 0      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10  ┆ 9   ┆ 1              ┆ 1        ┆ 2      ┆ 1      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10  ┆ 9   ┆ 1              ┆ 1        ┆ 2      ┆ 2      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10  ┆ 9   ┆ 1              ┆ 1        ┆ 2      ┆ 3      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 9   ┆ 9   ┆ 0              ┆ 1        ┆ 2      ┆ 3      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8   ┆ 9   ┆ -1             ┆ -1       ┆ 3      ┆ -1     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8   ┆ 10  ┆ -1             ┆ -1       ┆ 3      ┆ -2     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8   ┆ 8   ┆ 0              ┆ -1       ┆ 3      ┆ -2     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8   ┆ 8   ┆ 0              ┆ -1       ┆ 3      ┆ -2     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8   ┆ 10  ┆ -1             ┆ -1       ┆ 3      ┆ -3     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15  ┆ 11  ┆ 1              ┆ 1        ┆ 4      ┆ 1      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15  ┆ 11  ┆ 1              ┆ 1        ┆ 4      ┆ 2      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15  ┆ 15  ┆ 0              ┆ 1        ┆ 4      ┆ 2      │
└─────┴─────┴────────────────┴──────────┴────────┴────────┘

I've left the intermediate calculations in the output, merely to show how the algorithm works. (You can drop them.)

The basic idea is to calculate a run_id for each run of positive or negative values. We will then use the cumsum function and the over windowing expression to create a running count of positives/negatives over each run_id.

Key assumption: ties in columns a and b do not interrupt a run, but they do not contribute to the total for that run of positive/negative values.

sign_a_minus_b does two things: it identifies whether a run is positive/negative, and whether there is a tie in columns a and b.

run_type extends any run to include any cases where a tie occurs in columns a and b. The null value at the top of the column was intended - it shows what happens when a tie occurs in the first row.

result is the output column. Note that tied columns do not interrupt a run, but they don't contribute to the totals for that run.

One final note: if ties in columns a and b are not allowed, then this algorithm can be simplified ... and run faster.

0

Not very elegant or Pythonic, but something like the below should work:

import pandas as pd

df = pd.DataFrame({'a': [10, 10, 10, 8, 8, 8, 15, 15]
,'b': [9, 9, 9, 9, 10, 10, 11, 11]})

df['c'] = df.apply(lambda row: 1 if row['a'] > row['b'] else 0, axis=1)
df['d'] = df.apply(lambda row: 0 if row['a'] > row['b'] else -1, axis=1)
for i in range(1, len(df)):
    if df.loc[i, 'a'] > df.loc[i, 'b']: 
        df.loc[i, 'c'] = df.loc[i-1, 'c']  + 1
        df.loc[i, 'd'] = 0
    else:
        df.loc[i, 'd'] = df.loc[i-1, 'd']  - 1
        df.loc[i, 'c'] = 0
        
df['ans'] = df['c'] + df['d']
print(df)

Also you might need to think about what the value should be for the specific case when column a and b are equal.

aquaplane
  • 127
  • 4
  • Thanks very much for this but I already have a working solution in pandas which looks pretty much like the one above. Unfortunately, it is pretty slow and the entire process takes ages which is why i am transferring the whole process to polars and am trying to get rid of pandas. So, I am hoping someone else could come up with a polars version that is more elegant. – bruppfab May 05 '22 at 15:08