The issue:
In your code, the class labels are getting messed up because of the way the OneVsOneClassifier
works internally. It converts the original multi-class problem into multiple binary classification problems. For each of these binary problems, the classes are relabeled as 0
and 1
, which is why you see only 0
and 1
in your output.
The issue, detailed:
When you are using OneVsOneClassifier
, it is internally constructing multiple binary classifiers, each trained on only two of the original classes. For each of these binary classifiers, the class labels are transformed into 0
and 1
. This transformation is done internally by OneVsOneClassifier
to handle the binary classification problem.
Now, when you are inside your DataUndersampler
class, the labels y
that you receive are these transformed labels 0
and 1
, not the original labels from your multi-class problem. This is why your print statements inside DataUndersampler.fit_resample()
are showing the Counter
objects with keys 0
and 1
.
Here is an example to illustrate how this happens:
Suppose you have a multi-class problem with 3 classes, labeled 0
, 1
, and 2
. When OneVsOneClassifier
is applied, it will create 3 binary classifiers: one for class 0
vs class 1
, one for class 0
vs class 2
, and one for class 1
vs class 2
.
Now, for each of these binary classifiers, the classes are relabeled as 0
and 1
. That means, for the first classifier (class 0
vs class 1
), the original class 0
might be relabeled as 0
and the original class 1
might be relabeled as 1
. But for the second classifier (class 0
vs class 2
), the original class 0
might be relabeled as 0
, and the original class 2
might be relabeled as 1
. Similarly, for the third classifier (class 1
vs class 2
), the original class 1
might be relabeled as 0
, and the original class 2
might be relabeled as 1
.
When your DataUndersampler.fit_resample()
method receives y
, it is receiving these transformed labels, not the original labels from your multi-class problem.
The key point is that the re-labeling to 0
and 1
is done independently for each binary classifier and does not preserve the original labels. This is why you see only 0
and 1
in your output, and this is what I mean when I say "the class labels are getting messed up". It is not that the labels are being incorrectly assigned; rather, it is that the original labels are being transformed into 0
and 1
for each binary classification problem, which is not what you were expecting.
In order to keep track of the original labels, you would need to store them before the transformation and then map the binary labels back to the original labels after you have done the resampling.
Possible solution:
To address this issue, you can use instead the scikit-learn-contrib/imbalanced-learn
library (pip install -U imbalanced-learn
).
Its RandomUnderSampler
handles the relabeling issue internally and ensures that the original class labels are preserved.
In the original implementation, the class labels were getting "messed up" because the OneVsOneClassifier
was converting the multi-class problem into multiple binary classification problems. For each binary problem, the classes were being relabeled as 0 and 1. This is why you were seeing only 0 and 1 in your output, even if your original data had different labels.
With the RandomUnderSampler
, the class labels are preserved. The RandomUnderSampler
works by randomly selecting a subset of the majority class to create a new balanced dataset. The class labels from the original dataset are used in this new dataset.
So, in the new implementation, there is no need to maintain a mapping from the original class labels to the binary labels because the RandomUnderSampler
handles this issue for you. This is one of the benefits of using specialized libraries like imbalanced-learn, which provide robust solutions to common issues in machine learning.
Here is a modified version of your DataUndersampler
class that keeps track of original labels, and how it is used:
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
import numpy as np
class DataUndersampler:
def __init__(self):
self.sampler = RandomUnderSampler(random_state=42)
def fit(self, X, y):
self.sampler.fit_resample(X, y)
return self
def transform(self, X, y):
X_res, y_res = self.sampler.fit_resample(X, y)
return X_res, y_res
# Create a dummy dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=2, n_redundant=10, n_classes=3, weights=[0.01, 0.01, 0.98], class_sep=0.8, random_state=42)
# initialize your undersampler
undersampler = DataUndersampler()
# fit the undersampler and transform the data
X_resampled, y_resampled = undersampler.fit(X, y).transform(X, y)
print(f"Original class distribution: {Counter(y)}")
print(f"Resampled class distribution: {Counter(y_resampled)}")
# initialize the pipeline (without the undersampler)
pipeline = Pipeline([
('clf', OneVsOneClassifier(RandomForestClassifier(random_state=42)))
])
# fit the pipeline on the resampled data
pipeline.fit(X_resampled, y_resampled)
# now you can use your pipeline to predict
# y_pred = pipeline.predict(X_test) # assuming you have a test set X_test
I have commented out the last line since there is no X_test
defined in this code. If you have a separate test set, you can uncomment that line to make predictions.
The main changes are as follows:
RandomUnderSampler
is used instead of manually implementing the undersampling. This eliminates the need for the _undersample
function and significantly simplifies the fit
and transform
methods.
The fit
method now just fits the RandomUnderSampler
to the data and returns self
. This is because the fit
method of a transformer in a scikit-learn pipeline is expected to return self
.
The transform
method applies the fitted RandomUnderSampler
to the data and returns the undersampled data.
The main idea behind these changes is to leverage existing libraries and conventions as much as possible to make the code simpler, easier to understand, and more maintainable.
MWE
The minimal working example (MWE) would now be:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
# synthetic data
X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
n_informative=4)
print("Original class distribution:", Counter(y))
resampler = RandomUnderSampler(random_state=234)
rf_clf = RandomForestClassifier()
pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)
classifier.fit(X, y)
# predict and evaluate
y_pred = classifier.predict(X)
print("Predicted class distribution:", Counter(y_pred))
In this updated code:
- We are importing the
RandomUnderSampler
from imbalanced-learn.
- We replace the
DataUndersampler
with RandomUnderSampler
in the pipeline.
- We remove the print statements related to resampled class distribution, as the
RandomUnderSampler
does not provide this information directly. However, you can still get the distribution of the predicted classes after training the classifier.
This code should work without the label issue you were experiencing before. Also, it should be shorter and more concise than the original MWE.
We want to fit an SVC to determine the support vectors in each pair of classes, then ignore examples of the majority class farther away from its support vectors until we achieve data balance (n_majority = n_minority
examples).
Support vector-based undersampling
So your aim would be to undersample the majority class in a more informed way, taking into account the structure of the data rather than just randomly.
We need to revise the DataUndersampler
to perform this strategy.
The main idea would be to fit an SVC -- C-Support Vector Classification to the data, find the support vectors, and then undersample the majority class based on the distances to these support vectors.
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import resample
from sklearn.svm import SVC
import numpy as np
class DataUndersampler(BaseEstimator, TransformerMixin):
def __init__(self, random_state=None):
self.random_state = random_state
self.svc = SVC(kernel='linear')
def fit(self, X, y):
# Fit SVC to data
self.svc.fit(X, y)
return self
def transform(self, X, y):
# Get support vectors
support_vectors = self.svc.support_vectors_
# Get indices of support vectors
support_vector_indices = self.svc.support_
# Separate majority and minority classes
majority_class = y.value_counts().idxmax()
minority_class = y.value_counts().idxmin()
X_majority = X[y == majority_class]
y_majority = y[y == majority_class]
X_minority = X[y == minority_class]
y_minority = y[y == minority_class]
# Calculate distances of majority class samples to nearest support vector
distances = np.min(np.linalg.norm(X_majority.values[:, np.newaxis] - support_vectors, axis=2), axis=1)
# Sort the majority class samples by distance and take only as many as there are in minority class
sorted_indices = np.argsort(distances)
indices_to_keep = sorted_indices[:len(y_minority)]
# Combine the undersampled majority class with the minority class
X_resampled = pd.concat([X_majority.iloc[indices_to_keep], X_minority])
y_resampled = pd.concat([y_majority.iloc[indices_to_keep], y_minority])
return X_resampled, y_resampled
You can use this transformer in your pipeline like before:
resampler = DataUndersampler(random_state=234)
pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)
classifier.fit(X, y)
This approach will respect the data structure when undersampling, since it uses the support vectors of an SVM to guide the undersampling process. It should also resolve the issue of incorrect labels.
However, please note that this will be more computationally expensive than random undersampling due to the need to fit an SVM and calculate distances to support vectors for each pair of classes.
The new DataUndersampler
class is quite different from the original one, as it uses a different undersampling strategy.
Here are the main differences:
Support Vector Classifier (SVC): The new class fits an SVC to the data in the fit
method. This is a major difference, as the original class did not use any learning algorithm. The SVC is used to find the support vectors, which are the data points that define the decision boundary between classes.
Support vectors and distances: The new class uses the support vectors to calculate the distance from each data point in the majority class to its nearest support vector. This information is used to undersample the majority class, keeping the data points that are closest to the support vectors. In contrast, the original class used a random undersampling strategy, which does not take into account the structure of the data.
Resampling: The new class undersamples the majority class based on the calculated distances, keeping as many data points as there are in the minority class. This ensures that the classes are balanced, but also that the majority class data points that are kept are those that are most informative, as they are close to the decision boundary.
The original class also aimed to balance the classes, but it did so by randomly discarding data points from the majority class.
No more relabeling: The new class does not need to relabel the classes to 0
and 1
, which was causing problems in the original code.
The classes are kept as they are, as the SVC can handle the original labels.
Pandas: The new code makes use of pandas for data manipulation (e.g., separating the majority and minority classes, resampling the data). The original class used numpy arrays.
Scikit-learn compatibility: Like the original class, the new class extends the BaseEstimator
and TransformerMixin
classes from scikit-learn, so it can be used as part of a scikit-learn pipeline. The fit
and transform
methods are used to fit the SVC and undersample the data, respectively.
The new undersampling strategy used in the revised DataUndersampler
class is essentially a method known as support vector-based undersampling.
In this strategy, the core idea is to fit a Support Vector Machine (SVM) classifier to the data, which identifies the data points, called support vectors, that define the decision boundary between the classes.
Then, for each data point in the majority class, the distance to the nearest support vector is calculated. The rationale here is that the data points from the majority class that are closest to the decision boundary (i.e., the support vectors) are the most informative for the classification task, as they are on the 'edge' of the majority class and closest to the minority class.
The data points in the majority class are then ranked according to this distance, and the ones that are farthest from the decision boundary are discarded, until the number of data points in the majority class is equal to the number of data points in the minority class. This effectively undersamples the majority class, while preserving its most informative data points.
This strategy is different from the original one in the DataUndersampler
class, which simply randomly discards data points from the majority class until the classes are balanced. The support vector-based undersampling strategy is a more sophisticated and targeted approach, as it considers the structure of the data when deciding which data points to discard.