There are a number of options when dealing with imbalanced data.
1. You could use a weighting mechanism, whereby errors on the minor class are penalised more heavily.
From my own experience, SVMs (support vector machines) and XGBoost models are able to adjust weights to penalise errors on the minor class more heavily.
For instance, if generating classification predictions using an SVM, then the class_weight can be set to balanced as below in order to treat both classes equally:
from sklearn import svm
model = svm.SVC(gamma='scale',
class_weight='balanced')
model.fit(x1_train, y1_train)
predictions = clf.predict(x1_val)
For XGBoost, the scale_pos_weight can be set at the appropriate value so as to penalise errors on the minor class more heavily. The higher the value, the greater the weight appended to the minor class.
import xgboost as xgb
xgb_model = xgb.XGBClassifier(learning_rate=0.001,
max_depth = 1,
n_estimators = 100,
scale_pos_weight=3)
xgb_model.fit(x1_train, y1_train)
2. For oversampling the minor class, a technique such as SMOTE from the imblearn library can be used:
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
x1_train, y1_train = oversample.fit_resample(x1_train, y1_train)
This technique generates synthetic samples from the characteristics of the minor class so that the number of values for both classes are equal.
As for which technique to use - my recommendation would be to assess which technique performs best when comparing the predictions to the test data. However, I would add a caveat in that accuracy readings should be analysed with scepticism.
Accuracy vs. Precision vs. Recall
Let's take this example. We build a model that classifies on a dataset with a 90% major class and a 10% minor class. The model shows 90% accuracy when predicting against a test set.
However, there is a problem. The model fails to correctly classify any of the observations across the minor class in the test set. Thus, the model does very well at predicting the major class but very poorly at predicting the minor class.
In this regard, you should also note the readings of precision (no false positives) and recall (no false negatives). As an example, let us say a company wants to predict customers that cancel their subscription of a product (1 = cancel, 0 = do not cancel). 90% of customers do not cancel, but 10% do.
In this instance - because we want to minimise false negatives - we are looking for a high recall score. In this regard, a model with 60% overall accuracy but 90% recall would be preferable to a model with 90% accuracy but only 10% recall.