Randomly drop % of rows by condition in polars

Question

Imagine we have the following polars dataframe:

Now using polars we want to drop every row with Labels == 0 with 50% probability. An example output would be the following:

I think filter and sample might be handy... I have something but it is not working:

df = df.drop(df.filter(pl.col("Labels") == 0).sample(frac=0.5))

How can I make it work?

For you working well `df.filter(pl.col("Labels") == 0).sample(frac=0.5)` ? Then use `df = df.drop(df.filter(pl.col("Labels") == 0).sample(frac=0.5).index)` — jezrael, Sep 06 '22 at 05:32
@jezrael I tried the pandas way, but I get `AttributeError: 'DataFrame' object has no attribute 'index'` — Janikas, Sep 06 '22 at 05:34
OK, can you try `df.filter(pl.col("Labels") == 0).sample(frac=0.5).vstack(df.filter(pl.col("Labels") != 0))` — jezrael, Sep 06 '22 at 05:40
@jezrael thank you! That approach does work, however I would like the data to still be shuffled. Is there any way to avoid segregating `Labels==0` from `Labels!=0`? — Janikas, Sep 06 '22 at 05:55
@Janikas - Can you test edited answer? I add row count column and then sorting after `vstack` — jezrael, Sep 06 '22 at 06:16

jezrael · Answer 1 · 2022-09-06T06:27:30.097

1

df = (df.filter(pl.col("Labels") == 0).sample(frac=0.5)
        .vstack(df.filter(pl.col("Labels") != 0))
        .sample(frac=1, shuffle=True))

edited Sep 06 '22 at 06:27

answered Sep 06 '22 at 05:57

jezrael

Is the `sort` used for shuffle? – Janikas Sep 06 '22 at 06:20
@Janikas - no, need `sample`, answer was edited. – jezrael Sep 06 '22 at 06:27
That's exactly how I shuffled it! Thank you my friend! :) – Janikas Sep 06 '22 at 06:28

1 Answers1