A data resampler based on support vectors

Question

I am working to implement a data resampler to work based on support vectors. The idea is to fit an SVM classifier, get the support vector points of the classes, then balance the data by selecting only data points near the support vectors points of each class in a way that the classes have equal number of examples, ignoring all others (far from support vector points).

I am doing this in a multi-class setttings. So, I needed to resample the classes pairwise (i.e. one-against-one). I know that in sklean's SVM "...internally, one-vs-one (‘ovo’) is always used as a multi-class strategy to train models". However, since I am not sure how to change the training behaviour of the sklearn's SVM in a way to resample each pair during training, I implemented a custom class to do that.

Currently, the custom class works fine. However, in my implementation I have a bug (logic error) that changes each pair of class labels into 0 and 1, thereby messing up with my class labels. In the code below, I illustrate this with a MWE:

# required imports
import random
from collections import Counter
from math import dist
import numpy as np
from sklearn.svm import SVC
from sklearn.utils import check_random_state
from sklearn.multiclass import OneVsOneClassifier
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

np.random.seed(7)
random.seed(7)

# resampler class
class DataUndersampler():
  def __init__(self, random_state=None):
    self.random_state = random_state
    print('DataUndersampler()')

  def fit_resample(self, X, y):
    random_state = check_random_state(self.random_state)
    # class distribution
    counter = Counter(y)
    print(f'Original class distribution: {counter}')
    maj_class = counter.most_common()[0][0]
    min_class = counter.most_common()[-1][0]
    # number of minority examples
    num_minority = len(X[ y == min_class])
    #num_majority = len(X[ y == maj_class]) # check on with maj now
    svc = SVC(kernel='rbf', random_state=32)
    svc.fit(X,y)
    # majority class support vectors
    maj_sup_vectors = svc.support_vectors_[maj_class]
    #min_sup_vectors = svc.support_vectors_[min_class] # minority sup vect
    # compute distances to support vectors' point
    distances = []
    for i, x in enumerate(X[y == maj_class]): 
      #input(f'sv: {maj_sup_vectors}, x: {x}') # check value passed
      d = dist(maj_sup_vectors, x) 
      distances.append((i, d))
    # sort distances (reverse=False -> ascending)
    distances.sort(reverse=False, key=lambda tup: tup[1])
    index = [i for i, d in distances][:num_minority] 
    X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
    y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))
    print(f"Resampled class distribution ('ovo'): {Counter(y_ds)} \n")

    return X_ds, y_ds

So, working with this:

# synthetic data
X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
                           n_informative=4)

# actual class distribution
Counter(y)
Counter({0: 9924, 1: 22, 2: 15, 3: 13, 4: 26})

resampler = DataUndersampler(random_state=234)
rf_clf = model = RandomForestClassifier()

pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)
DataUndersampler()

classifier.fit(X, y)

Original class distribution: Counter({0: 9924, 1: 22})  
Resampled class distribution ('ovo'): Counter({0: 22, 1: 22}) 

Original class distribution: Counter({0: 9924, 1: 15}) # this should be {0: 9924, 2: 15}
Resampled class distribution ('ovo'): Counter({0: 15, 1: 15}) # should be-> {0: 9924, 2: 15}

Original class distribution: Counter({0: 9924, 1: 13}) # should be -> {0: 9924, 3: 13}
Resampled class distribution ('ovo'): Counter({0: 13, 1: 13}) # -> {0: 9924, 3: 13}

Original class distribution: Counter({0: 9924, 1: 26}) # should be-> {0: 9924, 4: 26}
Resampled class distribution ('ovo'): Counter({0: 26, 1: 26}) # -> {0: 9924, 4: 26}

Original class distribution: Counter({0: 22, 1: 15}) # should be > {1: 22, 2: 15}
Resampled class distribution ('ovo'): Counter({0: 15, 1: 15}) # -> {1: 22, 2: 15}

Original class distribution: Counter({0: 22, 1: 13}) # -> {1: 22, 3: 13}
Resampled class distribution ('ovo'): Counter({0: 13, 1: 13}) ## -> {1: 22, 3: 13}

Original class distribution: Counter({1: 26, 0: 22}) # -> {4: 26, 1: 22}
Resampled class distribution ('ovo'): Counter({1: 22, 0: 22}) # -> {4: 26, 1: 22}

Original class distribution: Counter({0: 15, 1: 13}) # -> {2: 15, 3: 13}
Resampled class distribution ('ovo'): Counter({0: 13, 1: 13}) # -> {2: 15, 3: 13}

Original class distribution: Counter({1: 26, 0: 15}) # -> {4: 26, 2: 15}
Resampled class distribution ('ovo'): Counter({1: 15, 0: 15}) # -> {4: 26, 2: 15}

Original class distribution: Counter({1: 26, 0: 13}) # -> {4: 26, 3: 13}
Resampled class distribution ('ovo'): Counter({1: 13, 0: 13}) # -> {4: 26, 3: 13}

How do I fix this?

VonC · Accepted Answer · 2023-06-03T18:36:12.497

The issue:

In your code, the class labels are getting messed up because of the way the OneVsOneClassifier works internally. It converts the original multi-class problem into multiple binary classification problems. For each of these binary problems, the classes are relabeled as 0 and 1, which is why you see only 0 and 1 in your output.

The issue, detailed:

When you are using OneVsOneClassifier, it is internally constructing multiple binary classifiers, each trained on only two of the original classes. For each of these binary classifiers, the class labels are transformed into 0 and 1. This transformation is done internally by OneVsOneClassifier to handle the binary classification problem.

Now, when you are inside your DataUndersampler class, the labels y that you receive are these transformed labels 0 and 1, not the original labels from your multi-class problem. This is why your print statements inside DataUndersampler.fit_resample() are showing the Counter objects with keys 0 and 1.

Here is an example to illustrate how this happens:

Suppose you have a multi-class problem with 3 classes, labeled 0, 1, and 2. When OneVsOneClassifier is applied, it will create 3 binary classifiers: one for class 0 vs class 1, one for class 0 vs class 2, and one for class 1 vs class 2.

Now, for each of these binary classifiers, the classes are relabeled as 0 and 1. That means, for the first classifier (class 0 vs class 1), the original class 0 might be relabeled as 0 and the original class 1 might be relabeled as 1. But for the second classifier (class 0 vs class 2), the original class 0 might be relabeled as 0, and the original class 2 might be relabeled as 1. Similarly, for the third classifier (class 1 vs class 2), the original class 1 might be relabeled as 0, and the original class 2 might be relabeled as 1.

When your DataUndersampler.fit_resample() method receives y, it is receiving these transformed labels, not the original labels from your multi-class problem.

The key point is that the re-labeling to 0 and 1 is done independently for each binary classifier and does not preserve the original labels. This is why you see only 0 and 1 in your output, and this is what I mean when I say "the class labels are getting messed up". It is not that the labels are being incorrectly assigned; rather, it is that the original labels are being transformed into 0 and 1 for each binary classification problem, which is not what you were expecting.

In order to keep track of the original labels, you would need to store them before the transformation and then map the binary labels back to the original labels after you have done the resampling.

Possible solution:

To address this issue, you can use instead the scikit-learn-contrib/imbalanced-learn library (pip install -U imbalanced-learn).
Its RandomUnderSampler handles the relabeling issue internally and ensures that the original class labels are preserved.

In the original implementation, the class labels were getting "messed up" because the OneVsOneClassifier was converting the multi-class problem into multiple binary classification problems. For each binary problem, the classes were being relabeled as 0 and 1. This is why you were seeing only 0 and 1 in your output, even if your original data had different labels.

With the RandomUnderSampler, the class labels are preserved. The RandomUnderSampler works by randomly selecting a subset of the majority class to create a new balanced dataset. The class labels from the original dataset are used in this new dataset.

So, in the new implementation, there is no need to maintain a mapping from the original class labels to the binary labels because the RandomUnderSampler handles this issue for you. This is one of the benefits of using specialized libraries like imbalanced-learn, which provide robust solutions to common issues in machine learning.

Here is a modified version of your DataUndersampler class that keeps track of original labels, and how it is used:

from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_classification
import numpy as np

class DataUndersampler:
    def __init__(self):
        self.sampler = RandomUnderSampler(random_state=42)

    def fit(self, X, y):
        self.sampler.fit_resample(X, y)
        return self

    def transform(self, X, y):
        X_res, y_res = self.sampler.fit_resample(X, y)
        return X_res, y_res

# Create a dummy dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=2, n_redundant=10, n_classes=3, weights=[0.01, 0.01, 0.98], class_sep=0.8, random_state=42)

# initialize your undersampler
undersampler = DataUndersampler()

# fit the undersampler and transform the data
X_resampled, y_resampled = undersampler.fit(X, y).transform(X, y)

print(f"Original class distribution: {Counter(y)}")
print(f"Resampled class distribution: {Counter(y_resampled)}")

# initialize the pipeline (without the undersampler)
pipeline = Pipeline([
    ('clf', OneVsOneClassifier(RandomForestClassifier(random_state=42)))
])

# fit the pipeline on the resampled data
pipeline.fit(X_resampled, y_resampled)

# now you can use your pipeline to predict
# y_pred = pipeline.predict(X_test)  # assuming you have a test set X_test

I have commented out the last line since there is no X_test defined in this code. If you have a separate test set, you can uncomment that line to make predictions.

The main changes are as follows:

RandomUnderSampler is used instead of manually implementing the undersampling. This eliminates the need for the _undersample function and significantly simplifies the fit and transform methods.
The fit method now just fits the RandomUnderSampler to the data and returns self. This is because the fit method of a transformer in a scikit-learn pipeline is expected to return self.
The transform method applies the fitted RandomUnderSampler to the data and returns the undersampled data.

The main idea behind these changes is to leverage existing libraries and conventions as much as possible to make the code simpler, easier to understand, and more maintainable.

MWE

The minimal working example (MWE) would now be:

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# synthetic data
X, y = make_classification(n_samples=10_000, n_classes=5, weights=[22.6, 3.7, 16.4, 51.9],
                           n_informative=4)

print("Original class distribution:", Counter(y))

resampler = RandomUnderSampler(random_state=234)
rf_clf = RandomForestClassifier()

pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)

classifier.fit(X, y)

# predict and evaluate
y_pred = classifier.predict(X)
print("Predicted class distribution:", Counter(y_pred))

In this updated code:

We are importing the RandomUnderSampler from imbalanced-learn.
We replace the DataUndersampler with RandomUnderSampler in the pipeline.
We remove the print statements related to resampled class distribution, as the RandomUnderSampler does not provide this information directly. However, you can still get the distribution of the predicted classes after training the classifier.

This code should work without the label issue you were experiencing before. Also, it should be shorter and more concise than the original MWE.

We want to fit an SVC to determine the support vectors in each pair of classes, then ignore examples of the majority class farther away from its support vectors until we achieve data balance (n_majority = n_minority examples).

Support vector-based undersampling

So your aim would be to undersample the majority class in a more informed way, taking into account the structure of the data rather than just randomly.

We need to revise the DataUndersampler to perform this strategy.
The main idea would be to fit an SVC -- C-Support Vector Classification to the data, find the support vectors, and then undersample the majority class based on the distances to these support vectors.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import resample
from sklearn.svm import SVC
import numpy as np

class DataUndersampler(BaseEstimator, TransformerMixin):
    def __init__(self, random_state=None):
        self.random_state = random_state
        self.svc = SVC(kernel='linear')

    def fit(self, X, y):
        # Fit SVC to data
        self.svc.fit(X, y)
        return self

    def transform(self, X, y):
        # Get support vectors
        support_vectors = self.svc.support_vectors_
        # Get indices of support vectors
        support_vector_indices = self.svc.support_

        # Separate majority and minority classes
        majority_class = y.value_counts().idxmax()
        minority_class = y.value_counts().idxmin()
        X_majority = X[y == majority_class]
        y_majority = y[y == majority_class]
        X_minority = X[y == minority_class]
        y_minority = y[y == minority_class]

        # Calculate distances of majority class samples to nearest support vector
        distances = np.min(np.linalg.norm(X_majority.values[:, np.newaxis] - support_vectors, axis=2), axis=1)

        # Sort the majority class samples by distance and take only as many as there are in minority class
        sorted_indices = np.argsort(distances)
        indices_to_keep = sorted_indices[:len(y_minority)]

        # Combine the undersampled majority class with the minority class
        X_resampled = pd.concat([X_majority.iloc[indices_to_keep], X_minority])
        y_resampled = pd.concat([y_majority.iloc[indices_to_keep], y_minority])

        return X_resampled, y_resampled

You can use this transformer in your pipeline like before:

resampler = DataUndersampler(random_state=234)
pipeline = Pipeline([('sampler', resampler), ('clf', rf_clf)])
classifier = OneVsOneClassifier(estimator=pipeline)
classifier.fit(X, y)

This approach will respect the data structure when undersampling, since it uses the support vectors of an SVM to guide the undersampling process. It should also resolve the issue of incorrect labels.
However, please note that this will be more computationally expensive than random undersampling due to the need to fit an SVM and calculate distances to support vectors for each pair of classes.

The new DataUndersampler class is quite different from the original one, as it uses a different undersampling strategy.
Here are the main differences:

Support Vector Classifier (SVC): The new class fits an SVC to the data in the fit method. This is a major difference, as the original class did not use any learning algorithm. The SVC is used to find the support vectors, which are the data points that define the decision boundary between classes.
Support vectors and distances: The new class uses the support vectors to calculate the distance from each data point in the majority class to its nearest support vector. This information is used to undersample the majority class, keeping the data points that are closest to the support vectors. In contrast, the original class used a random undersampling strategy, which does not take into account the structure of the data.
Resampling: The new class undersamples the majority class based on the calculated distances, keeping as many data points as there are in the minority class. This ensures that the classes are balanced, but also that the majority class data points that are kept are those that are most informative, as they are close to the decision boundary.
The original class also aimed to balance the classes, but it did so by randomly discarding data points from the majority class.
No more relabeling: The new class does not need to relabel the classes to 0 and 1, which was causing problems in the original code.
The classes are kept as they are, as the SVC can handle the original labels.
Pandas: The new code makes use of pandas for data manipulation (e.g., separating the majority and minority classes, resampling the data). The original class used numpy arrays.
Scikit-learn compatibility: Like the original class, the new class extends the BaseEstimator and TransformerMixin classes from scikit-learn, so it can be used as part of a scikit-learn pipeline. The fit and transform methods are used to fit the SVC and undersample the data, respectively.

The new undersampling strategy used in the revised DataUndersampler class is essentially a method known as support vector-based undersampling.

In this strategy, the core idea is to fit a Support Vector Machine (SVM) classifier to the data, which identifies the data points, called support vectors, that define the decision boundary between the classes.

Then, for each data point in the majority class, the distance to the nearest support vector is calculated. The rationale here is that the data points from the majority class that are closest to the decision boundary (i.e., the support vectors) are the most informative for the classification task, as they are on the 'edge' of the majority class and closest to the minority class.

The data points in the majority class are then ranked according to this distance, and the ones that are farthest from the decision boundary are discarded, until the number of data points in the majority class is equal to the number of data points in the minority class. This effectively undersamples the majority class, while preserving its most informative data points.

This strategy is different from the original one in the DataUndersampler class, which simply randomly discards data points from the majority class until the classes are balanced. The support vector-based undersampling strategy is a more sophisticated and targeted approach, as it considers the structure of the data when deciding which data points to discard.

Just to check on this again, please. Why do I encounter this error in the revised DataUndersampler: `TypeError: DataUndersampler.transform() missing 1 required positional argument: 'y'`? This happens when fitting `classifier.fit(X, y)` — arilwan, Jul 04 '23 at 13:19
@arilwan Strange, because `X_resampled, y_resampled = undersampler.fit(X, y).transform(X, y)` does call `DataUndersampler#transform()` both with `X` *and* `y`. Maybe another part of your code, or imported classes/framework uses `DataUndersampler#transform()` with only one parameter? — VonC, Jul 04 '23 at 16:22
I do not think so. I tried this in a colab notebook with code and imports from this question only. — arilwan, Jul 13 '23 at 13:35
@arilwan OK, not sure where the error is coming from then... — VonC, Jul 13 '23 at 13:38

A data resampler based on support vectors

1 Answers1

The issue:

The issue, detailed:

Possible solution:

MWE

Support vector-based undersampling

Linked