0

Imagine we have the following polars dataframe:

Feature 1 Feature 2 Labels
100 25 1
150 18 0
200 15 0
230 28 0
120 12 1
130 34 1
150 23 1
180 25 0

Now using polars we want to drop every row with Labels == 0 with 50% probability. An example output would be the following:

Feature 1 Feature 2 Labels
100 25 1
200 15 0
230 28 0
120 12 1
130 34 1
150 23 1

I think filter and sample might be handy... I have something but it is not working:

df = df.drop(df.filter(pl.col("Labels") == 0).sample(frac=0.5))

How can I make it work?

Janikas
  • 418
  • 1
  • 8
  • For you working well `df.filter(pl.col("Labels") == 0).sample(frac=0.5)` ? Then use `df = df.drop(df.filter(pl.col("Labels") == 0).sample(frac=0.5).index)` – jezrael Sep 06 '22 at 05:32
  • @jezrael I tried the pandas way, but I get `AttributeError: 'DataFrame' object has no attribute 'index'` – Janikas Sep 06 '22 at 05:34
  • 1
    OK, can you try `df.filter(pl.col("Labels") == 0).sample(frac=0.5).vstack(df.filter(pl.col("Labels") != 0))` – jezrael Sep 06 '22 at 05:40
  • `df.drop(df[df['Labels']==0].sample(frac=0.5).index)` – Mehdi Khademloo Sep 06 '22 at 05:42
  • @jezrael thank you! That approach does work, however I would like the data to still be shuffled. Is there any way to avoid segregating `Labels==0` from `Labels!=0`? – Janikas Sep 06 '22 at 05:55
  • Okay I found shuffle from polars, nice! :) – Janikas Sep 06 '22 at 05:55
  • @Janikas - Can you test edited answer? I add row count column and then sorting after `vstack` – jezrael Sep 06 '22 at 06:16

1 Answers1

1

You can use polars.DataFrame.vstack:

df = (df.filter(pl.col("Labels") == 0).sample(frac=0.5)
        .vstack(df.filter(pl.col("Labels") != 0))
        .sample(frac=1, shuffle=True))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252