Compare two Pola-rs dataframes by position

Question

Suppose I have two dataframes like:

let df_1 = df! {
        "1" => [1,   2,   2,   3,   4,   3],
        "2" => [1,   4,   2,   3,   4,   3],
        "3" => [1,   2,   6,   3,   4,   3],
    }
    .unwrap();

    let mut df_2 = df_1.clone();
    for idx in 0..df_2.width() {
        df_2.apply_at_idx(idx, |s| {
            s.cummax(false)
                .shift(1)
                .fill_null(FillNullStrategy::Zero)
                .unwrap()
        })
        .unwrap();
    }

    println!("{:#?}", df_1);
    println!("{:#?}", df_2);

shape: (6, 3)
┌─────┬─────┬─────┐
│ 1   ┆ 2   ┆ 3   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 2   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 3   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 4   ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 3   ┆ 3   │
└─────┴─────┴─────┘
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1   ┆ 2   ┆ 3   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 0   ┆ 0   ┆ 0   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 1   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 4   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 4   ┆ 6   │
└─────┴─────┴─────┘

and I want to compare them such that I end up with a boolean dataframe I can use as a predicate for a selection and aggregation:

shape: (6, 3)
┌───────┬───────┬───────┐
│ 1     ┆ 2     ┆ 3     │
│ ---   ┆ ---   ┆ ---   │
│ bool  ┆ bool  ┆ bool  │
╞═══════╪═══════╪═══════╡
│ true  ┆ true  ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ true  ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ false ┆ true  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true  ┆ true  ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false ┆ false │
└───────┴───────┴───────┘

In Python Pandas I might do df.where(df_1.ge(df_2)).sum().sum(). What's the idiomatic way to do that with Rust Pola-rs?

AOC 2022 Day 8? :D I feel your pain. I did it with `ndarray` — cadolphs, Dec 09 '22 at 04:19
Anyway, you're only summing over bool arrays to count how many values are true. Can't you just use `somethingsomething.iter().filter(|x| x == true).count()`? — cadolphs, Dec 09 '22 at 04:26
Not quite, I'm summing integer values where the corresponding bool dataframe is true. I guess I could convert that and iterate over but at that point there's not much benefit in using the dataframe in the first place? — baarkerlounger, Dec 09 '22 at 07:28

score 1 · Answer 1 · edited Dec 11 '22 at 21:22

It took me the longest time to figure out how to even do elementwise addition in polars. I guess that's just not the "normal" way to use these things as in principle the columns can have different data types.

You can't call zip and map on the dataframe directly. That doesn't work.

But. df has a method iter() that gives you an iterater over all the columns. The columns are Series, and for those you have all sorts of elementwise operations implemented.

Long story short

let df = df!("A" => &[1, 2, 3], "B" => &[4, 5, 6]).unwrap();
let df2 = df!("A" => &[6, 5, 4], "B" => &[3, 2, 1]).unwrap();

let df3 = DataFrame::new(
            df.iter()
              .zip(df2.iter())
              .map(|(series1, series2)| series1.gt(series2).unwrap())
              .collect());

That gives you your boolean array. From here, I assume it should be possible to figure out how to use that for indexing. Probably another use of df.iter().zip(df3) or whatever.

It seems the Python API does have these methods, e.g. df_1 > df_2 https://github.com/pola-rs/polars/blob/7d26e6c0a8ce105d11129385f7dce09d3df0faa3/py-polars/polars/internals/dataframe/frame.py#L996 though they loop over the series in a similar way to your answer under the hood and don't seem to exist in the Rust API. Looks like this is the only way currently. — baarkerlounger, Dec 12 '22 at 08:44

jqurious · Accepted Answer · 2022-12-13T21:54:11.310

<edit>

If you actually have a single dataframe you can do:

let mask = 
    when(all().gt_eq(
            all().cummax(false).shift(1).fill_null(0)))
    .then(all())
    .otherwise(lit(NULL));

let out = 
    df_1.lazy().select(&[mask])
    //.sum()
    .collect();

</edit>

From https://stackoverflow.com/a/72899438

Masking out values by columns in another DataFrame is a potential for errors caused by different lengths. For this reason polars does not encourage such operations

It appears the recommended way is to add a suffix to one of the dataframes, "concat" them and use when/then/otherwise.

.with_context() has been added since that answer which can be used to access both dataframes.

In Python:

df1.lazy().with_context(
   df2.lazy().select(pl.all().suffix("_right"))
).select([
   pl.when(pl.col(name) >= pl.col(f"{name}_right"))
     .then(pl.col(name)) 
   for name in df1.columns
]).collect()

I've not used rust - but my attempt at a translation:

let mask = 
   df_1.get_column_names().iter().map(|name| 
      when(col(name).gt_eq(col(&format!("{name}_right"))))
      .then(col(name))
      .otherwise(lit(NULL))
   ).collect::<Vec<Expr>>();

let out = 
    df_1.lazy()
    .with_context(&[
        df_2.lazy().select(&[all().suffix("_right")])
    ])
    .select(&mask)
    //.sum()
    .collect();

println!("{:#?}", out);

Output:

Ok(shape: (6, 3)
┌──────┬──────┬──────┐
│ 1    ┆ 2    ┆ 3    │
│ ---  ┆ ---  ┆ ---  │
│ i32  ┆ i32  ┆ i32  │
╞══════╪══════╪══════╡
│ 1    ┆ 1    ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ 4    ┆ 2    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ null ┆ 6    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3    ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4    ┆ 4    ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘)

Compare two Pola-rs dataframes by position

2 Answers2