I am using imbalanced-learn to oversample my data. I want to know how many entries in each class there are after using the oversampling method. This code works nicely:
import imblearn.over_sampling import SMOTE
from collections import Counter
def oversample(x_values, y_values):
oversampler = SMOTE(random_state=42, n_jobs=-1)
x_oversampled, y_oversampled = oversampler.fit_resample(x_values, y_values)
print("Oversampling training set from {0} to {1} using {2}".format(dict(Counter(y_values)), dict(Counter(y_over_sampled)), oversampling_method))
return x_oversampled, y_oversampled
But I switched to using a pipeline so I can use GridSearchCV to find the best oversampling method (out of ADASYN, SMOTE and BorderlineSMOTE). Therefore I never actually call fit_resample myself and lose my output using something like this:
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier())])
pipe.fit(x_values, y_values)
The upsampling works, but I lose my output on how many entries for each class there are in the training set.
Is there a way of getting a similar output than the first example using a pipeline?