1

I am doing a binary classification problem, I am struggling with removing outliers and also increasing accuracy.

Ratings are one my feature looks like this:

enter image description here

0        0.027465
1        0.027465
2        0.027465
3        0.027465
4        0.027465
           ...   
26043    0.027465
26044    0.027465
26045    0.102234
26046    0.027465
26047    0.027465

mean value of the data:

train.ratings.mean()
0.03871552285960927 

std of the data:

train.ratings.std()
0.07585168664836195

I tried the log transformation but accuracy is not increased:

train['ratings']=np.log(train.ratings+1)

my goal is to classify the data true or false:

train.netgain
0        False
1        False
2        False
3        False
4         True
         ...  
26043     True
26044    False
26045     True
26046    False
26047    Fals 
marton mar suri
  • 109
  • 2
  • 14
  • Is `ratings` a feature of your model, or is that the output score? Not sure whether what you want in a input feature that doesn't have the outliers, or you would prefer an output score which has a wider distribution? – Robert King Dec 23 '19 at 13:15
  • What is your goal? You just want to remove outliers from the ratings feature? If so, what is your criterion for an outlier? For example you can assume that the outliers are observations which are further than 3 standard deviations from the mean or observations with a value bigger than a specific quantile. You need to be more specific. – treskov Dec 23 '19 at 13:23
  • Hi Robert, ratings is one of my feature, it seems like an outlier, but you're saying there is no outlier. – marton mar suri Dec 23 '19 at 16:56

2 Answers2

1

One method I used was to calculate a MAD and after that I tag all outlier with a bool type with that I can get all outliers.

Sample of MAD calculation:

def mad(x): return np.median(np.abs(x - np.median(x)))

def mad_ratio(x): mad_value = mad(x) if mad_value == 0: return 0 x_mad = np.abs(x - np.median(x)) / mad_value return x_mad

TZof
  • 150
  • 1
  • 7
0
  • Assume that the rating feature is normally distributed and convert it to the standard normal distribution

  • From normal distribution, we know 99.7% values are covered with 3 standard deviations. so we can remove the values which are above 3 standard deviations away from the mean.

enter image description here.**

See below for python code.

ratings_mean=train['ratings'].mean()  #Finding the mean of ratings column

ratings_std=train['ratings'].std()     # standard deviation of the column

train['ratings']=train['ratings'].map(lamdba x: (x - ratings_mean)/ ratings_std

Ok, now we have now converted our data into a standard normal distribution. Now we if you see, its mean should be 0 and the standard deviation should be 1. From this, we can find out which are greater than 3 and less than -3. so that we can remove those rows from the dataset.

train=train[np.abs(train_ratings) < 3]

Now train dataframe will remove the outliers from the dataset.

**Note: You can apply 2 standard deviations as well because 2-std contains 95% of the data. Its all depends on the domain knowledge and your data. **

Ravi
  • 2,778
  • 2
  • 20
  • 32