0

I am building a binary classification model on a heavily unbalanced dataset(95% 1s and 5% 0s). I want to drop the rows with outliers and I used the below code:

from scipy import stats
df=df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

However, this code is dropping the rows that have my label 0. Is there a better way of dropping rows with outliers for all columns except the label column?

dododips
  • 91
  • 1
  • 6

1 Answers1

2

Try this (assume your label is located in df["label"]):

df = df[(df["label"] == 0) | (np.abs(stats.zscore(df)) < 3).all(axis=1)]

The first condition will keep all rows with df["label"] == 0 disregard of the zscore.

Bill Huang
  • 4,491
  • 2
  • 13
  • 31