0

I'm having a hard time understanding how to use the lazy with_context. The docs say

This allows expressions to also access columns from DataFrames that are not part of this one.

I want to filter a column based on a column from another frame, but I get errors stating that the column from the other frame does not exist. Not sure what I'm doing wrong here, as it seems to fit the description of what the with_context docs describe.

main.rs

use polars::prelude::*;

fn main() {
    let df0 = df! {
        "id" => [1, 2, 3],
        "name" => ["foo", "bar", "baz"],
    }
    .unwrap()
    .lazy();
    let other_df = df! {
        "other_id" => [1,2,2,1],
        "name" => ["w", "x", "y", "z"],
    }
    .unwrap()
    .lazy();

    let lf = df0.with_context(&[other_df]);
    let res = lf
        .filter(col("id").is_in(col("other_id")))
        .collect()
        .unwrap();
    println!("{:?}", res);
}

Cargo.toml

[dependencies]
polars = {git = "https://github.com/pola-rs/polars", branch = "master", features = ["lazy", "is_in"]}

Edit: If i do select instead of filter, I don't get an error.

let res = lf
    .select(&[col("id").is_in(col("other_id"))])
    .collect()
    .unwrap();
Cory Grinstead
  • 511
  • 3
  • 16

1 Answers1

0

Ran into this problem today. The workaround I found was to abuse column remappings to force the is_in expression to save as a boolean mask, which does what the select statement above does but explicitly. I don't use the Rust version (only the Python one atm) but I looked up the API again and it seems pretty similar, though I don't know if the Rust API is explicitly updated as often as the Python or JavaScript ones.

It seems to be a very specific thing with how with_context interacts with .filter() and I'm not entirely sure why. The best and most explicit workaround I found is to use .with_column() which doesn't seem to have the same issues, at least in the version of Polars I'm currently using (py-polars 0.16.2)

When using with_context, it's easier to assume a differing column which can be generated on the fly using .select(), col(), or all() with chained .suffix() or .prefix statements. If you look at the resulting LazyFrame from any with_context statement, with_context attaches the columns in the lzf.columns structure.

import polars as pl
...
...
lzf_filtered = lzf
.with_context(filtering_lzf)
.with_columns([
    pl.col(col_to_filter)
    .is_in(pl.col(filtering_col))
    .alias("some_col_name")    
]).filter(pl.col("some_col_name")

You can then select or drop the additional columns however you choose. My advice would be don't stack contexts. It can get really messy over time. I just usually use regex filtering in pl.col() if I've stacked up columns I don't want, or explicit filtering if I'm pruning a data set.

While I appreciate that this isn't in exactly the same language, the stylistic changes from the Rust API are fairly minimal from the Python one. I'm not gonna screw it up by actually writing Rust when I don't really use it, but given that they are basically using the same code base,

Hopefully that helps someone else stuck with the same problem.