Filtering on a large number (hundreds) of conditions

Question

I have a largish dataframe (5.5M rows, four columns). The first column (let's call it column A) has 235 distinct entries. The second column (B) has 100 distinct entries, integers from 0 to 99, all present in various proportions for each entry in A. I groupby on A an aggregate by randomly selecting a value from B. Something like

df.groupby("A").agg(
    pl.col("B").unique().apply(np.random.choice)
)

In doing so, every value in A is attributed a random integer. My goal is to select, from the dataframe, the columns C et D corresponding to the pairs A, B so generated.

My approach so far is

choice = df.groupby("A").agg(
    pl.col("B").unique().apply(np.random.choice)
).to_numpy()
lchoice = ((df["A"] == arr[0]) & (df["B"] == arr[1]) for arr in choice)
mask = functools.reduce(operator.or_, lchoice)
sdf = df[mask].select(["C", "D"])

It does the job, but does not feel very idiomatic.

My first attempt was

sdf = df.filter(functools
                .reduce(operator.or_,
                        [(pl.col("A") == arr[0]) & (pl.col("B") == arr[1])
                         for arr in choice]))

but it hangs until I kill it (I waited for ~30 minutes, while the other approach takes 1.6 seconds).

df.filter(
    (pl.col("period") == choice[0, 0]) & (pl.col("exp_id") == choice[0, 1])
)

works fine, as expected, and I have used the functools.reduce construct successfully as argument to filter in the past. Obviously, I do not want to write them all by hand; I could loop over the rows of choice, filter df one at a time and then concatenate the dataframes, but it sounds much more expensive than it should be.

Any tip on getting to my sdf "the polars way", without having to create temporary objects, arrays, etc.? As I said, I have a working solution, but it is kind of shaky, and I am interested in learning better polars.

EDIT: some mock data

df = pl.DataFrame({"A": [1.3, 8.9, 6.7]*3 + [3.6, 4.1]*2,
                   "B": [1]*3 + [2]*3 + [3]*3 + [1]*2 + [2]*2,
                   "C": [21.5, 24.3, 21.8, 20.8, 23.6, 15.6, 23.5,
                         16.1, 15.6, 14.8, 14.7, 23.8, 20.],                   
                   "D": [6.9, 7.6, 6.4, 6.2, 7.6, 6.2,
                         6.3, 7.1, 7.8,7.7, 6.5, 6.6, 7.1]})

Slight twist on the accepted answer:

df.sort(by=['A', 'B'], in_place=True)
sdf = (df
       .join(df
             .groupby('A', maintain_order=True)
             .agg(pl.col('B')
                  .unique()
                  .sort()
                  .shuffle(seed)
                  .first()
                  .alias('B')),   
             on=['A', 'B'])
       .select(['C','D']))

I need to perform this operation multiple time, and I'd like to ensure reproducibility of the random generation, hence the sorts and maintain_order=True.

Could you add some data to your question? – ritchie46 Mar 24 '22 at 10:11 — ritchie46, Mar 24 '22 at 10:11

score 1 · Accepted Answer · 2022-03-25T18:01:31.440

It looks like you can accomplish what you need with a join.

Let's see if I understand your question. Let's start with this data:

import polars as pl

df = pl.DataFrame(
    {
        "A": ["a", "b", "c", "c", "b", "a"] * 2,
        "B": [1, 2, 3, 2] * 3,
        "C": [1, 2, 3, 4, 5, 6] * 2,
        "D": ["x", "y", "y", "x"] * 3,
    }
).sort(["A", "B"])
print(df)

shape: (12, 4)
┌─────┬─────┬─────┬─────┐
│ A   ┆ B   ┆ C   ┆ D   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪═════╡
│ a   ┆ 1   ┆ 1   ┆ x   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ a   ┆ 2   ┆ 6   ┆ y   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ a   ┆ 2   ┆ 6   ┆ x   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ a   ┆ 3   ┆ 1   ┆ y   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 1   ┆ 5   ┆ x   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   ┆ 2   ┆ y   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   ┆ 2   ┆ x   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 3   ┆ 5   ┆ y   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 1   ┆ 3   ┆ x   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 2   ┆ 4   ┆ x   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 2   ┆ 4   ┆ y   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 3   ┆ 3   ┆ y   │
└─────┴─────┴─────┴─────┘

Next, we randomly select a value for B for every value of A.

I'm going to change your code slightly to eliminate the use of numpy and instead use Polars' shuffle expression. This way, we get a Polars DataFrame back, which we'll use in the upcoming join. (The documentation for the shuffle expression is here. It uses numpy if no seed is provided.)

choice_df = df.groupby("A").agg(pl.col("B").unique().shuffle().first())
print(choice_df)

shape: (3, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ b   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 1   │
└─────┴─────┘

If I understand your question correctly, we now want to get the values for columns C and D in the original dataset that correspond to the three combinations of A and B that we selected in the previous step. We can accomplish this most simply with a join, followed by a select. (The select is merely to eliminate columns A and B from the result).

df.join(choice_df, on=['A','B']).select(['C','D'])

shape: (4, 2)
┌─────┬─────┐
│ C   ┆ D   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 6   ┆ y   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 6   ┆ x   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 5   ┆ y   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ x   │
└─────┴─────┘

Does this accomplish what you need? The resulting code is clean, concise, and typical of the use of the Polars API.

It does the job perfectly, down to 120 ms from 1.6 s. I did not think of join, cheers! — Aubergine, Mar 25 '22 at 20:41

Filtering on a large number (hundreds) of conditions

1 Answers1