-1

I am new to DataScience, and here to clarify some doubts. I have a dataset which is imbalanced with 3 classes mainly called 1,2,3. '2' consist of majority(56.89%), '1' consist of 9.6% and '3' consist of 33.4%. May i know what is the correct procedure of handling imbalanced datasets, and hoping to have a higher prediction accuracy in the end.

Right now what i am doing is,

1) Split the datasets to 70:30 (Train/Test)

2) use SMOTE to make it balanced

3) Trying to use feature selection to find the most important feature and re-transform to new trainset for testing. But it faced with an error.

My Jupyter notebook faced with an error after 3rd step, MemoryError: could not allocate 14680064 bytes. May i know why too? Thank you so much, any advice or help is appreciated!

  • Whats the current PRF scores ? https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html – Yash Kumar Atri Apr 07 '20 at 12:42

2 Answers2

1

Please dont use Accuracy in multi-class problem.

The solution depends on what you really want, Is that minority class equally important as that of the majority ones.

About the handling, One thing that you can do it make your dataset balanced at the time of training by reducing the sample space of majority classes to the equivalent of minority classes, if those data points are too small then maybe you can make a 2 level classifier. About creating artificial data points (SMOTE), It might work sometimes and might not, depends problem to problem so state your problem. Compute and provide PRFS for better understanding of what you really wanna achieve.

About the memory error you have some variables that are requesting more then what your system can handle and by that I mean system reserves some extra space and you are going way beyond that or the most lovely factor that we all face in data science is "The Dimensionality Curse".

Yash Kumar Atri
  • 786
  • 1
  • 9
  • 27
  • Hi Sir, thanks for answering my question. I am actually predicting damage caused by earthquake, Low(1), Med(2), High(3). I went to do did a SMOTE after splitting the data. Before SMOTE : Counter({2: 148259, 3: 87218, 1: 25124}) After SMOTE: Counter({3: 148259, 2: 148259, 1: 148259}). Reduce the sample space does it mean, undersample the majority of (3) and (2) to make it equivalent to 1? And pardon for my understanding what is PRFS? Thank you so much. – Kelvin Wong Wei Liang Apr 07 '20 at 12:54
  • Yeah, Undersample them , Precision Recall and F Score, Check the earlier comment with link to sklearn prfs. Please report PRF and then determine what do you want to improve. – Yash Kumar Atri Apr 07 '20 at 12:55
  • Hi Sir, the PRF for without sampling (this is after a split and using randomforest classifier) is: Low(1) - precision (0.1) recall (0.46) f1-score(0.56) 7601 Med(2) - precision (0.73) recall (0.86) f1-score (0.79) 44414 High(3) - precision (0.76) recall (0.62) f1-score (0.68) 26166 Hoping to increase the accuracy of the prediction. thank so much – Kelvin Wong Wei Liang Apr 07 '20 at 13:16
0

Here is a generic example for you to consider.

import pandas as pd
import numpy as np

# Read dataset
df = pd.read_csv('balance-scale.data', 
                 names=['balance', 'var1', 'var2', 'var3', 'var4'])

# Display example observations
df.head()

df['balance'].value_counts()
# R    288
# L    288
# B     49
# Name: balance, dtype: int64

# Transform into binary classification
df['balance'] = [1 if b=='B' else 0 for b in df.balance]

df['balance'].value_counts()
# 0    576
# 1     49
# Name: balance, dtype: int64
# About 8% were balanced

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Next, we'll fit a very simple model using default settings for everything.
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)

# Train model
clf_0 = LogisticRegression().fit(X, y)

# Predict on training set
pred_y_0 = clf_0.predict(X)

# How's the accuracy?
print( accuracy_score(pred_y_0, y) )
# 0.9216

# So our model has 92% overall accuracy, but is it because it's predicting only 1 class?
# Should we be excited?
print( np.unique( pred_y_0 ) )
# [0]

# at this point, we need to use RESAMPLING!
from sklearn.utils import resample
# Separate majority and minority classes
# upsample the miniority class
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=576,    # to match majority class
                                 random_state=123) # reproducible results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts
df_upsampled.balance.value_counts()
# 1    576
# 0    576
# Name: balance, dtype: int64

# Separate input features (X) and target variable (y)
y = df_upsampled.balance
X = df_upsampled.drop('balance', axis=1)

# Train model
clf_1 = LogisticRegression().fit(X, y)

# Predict on training set
pred_y_1 = clf_1.predict(X)

# Is our model still predicting just one class?
print( np.unique( pred_y_1 ) )
# [0 1]

# How's our accuracy?
print( accuracy_score(y, pred_y_1) )
# 0.513888888889

# Great, now the model is no longer predicting just one class. While the accuracy also # took a nosedive, it's now more meaningful as a performance metric.

# now we need to downsample the majority class
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=49,     # to match minority class
                                 random_state=123) # reproducible results

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

# Display new class counts
df_downsampled.balance.value_counts()
# 1    49
# 0    49
# Name: balance, dtype: int64

# Separate input features (X) and target variable (y)
y = df_downsampled.balance
X = df_downsampled.drop('balance', axis=1)

# Train model
clf_2 = LogisticRegression().fit(X, y)

# Predict on training set
pred_y_2 = clf_2.predict(X)

# Is our model still predicting just one class?
print( np.unique( pred_y_2 ) )
# [0 1]

# How's our accuracy?
print( accuracy_score(y, pred_y_2) )
# 0.581632653061

Always remember that the Random Forest algo can handle imbalanced data sets quite well, so maybe this is all you need! I typically start every experiment with Random Forest. If this produces the results I am after, I'm done. No need to hunt and peck for the best algo in the universe. You can easily automate the process of testing dozens of algorithms on any given data set as well.

# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)

# Train model
clf_4 = RandomForestClassifier()
clf_4.fit(X, y)

# Predict on training set
pred_y_4 = clf_4.predict(X)

# Is our model still predicting just one class?
print( np.unique( pred_y_4 ) )
# [0 1]

# How's our accuracy?
print( accuracy_score(y, pred_y_4) )
# 0.9744

# What about AUROC?
prob_y_4 = clf_4.predict_proba(X)
prob_y_4 = [p[1] for p in prob_y_4]
print( roc_auc_score(y, prob_y_4) )
# 0.999078798186

Reference:

https://elitedatascience.com/imbalanced-classes

ASH
  • 20,759
  • 19
  • 87
  • 200
  • Hi sir, thanks for answering my question. So if i use randomforest algo it will automatically helps me with balancing the dataset without myself doing the upsampling/downsampling/smote? Cause i am trying to predicit building damage caused by earthquake. (Low,Med,High) Thank you so much – Kelvin Wong Wei Liang Apr 07 '20 at 13:43
  • Well, I would surmise that it should work. Try it and see! You have nothing to lose! – ASH Apr 07 '20 at 13:47