0

Say I have a binary imbalanced dataset like so:

from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot as plt
from imblearn.over_sampling import SMOTE

# fake dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
    n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)
# summarize class distribution
counter = Counter(y)

print(counter)
Counter({0: 9900, 1: 100})

Using SMOTE to oversample minority class:

oversample = SMOTE()
Xs, ys = oversample.fit_resample(X, y)

Now, to show a histogram of class distribution:

a. before oversample:

plt.hist(y)

enter image description here

b. after oversampled:

plt.hist(ys)

enter image description here

But I would like to show in the oversampled plot, portion of the minority class generated in a different color.

Expected output:

Similar to the figure below:

enter image description here

  • @JohanC I do not understand what you mean. –  Mar 08 '22 at 20:40
  • Ah, know I see what you mean. I do not mean colour each class separately. Class `1` should have a stacked bar with 100 blues (orignal) at buttom, and 9800 orange on top. Stacked together. –  Mar 08 '22 at 21:16

1 Answers1

0

You can use plt.bar for a bar plot. By drawing two bar plots onto the same subplot, the first still is partially visible.

import matplotlib.pyplot as plt
import numpy as np

# simulate before oversampling
y = np.random.choice([0, 1], 1000, p=[.95, .05])
# simulate after oversampling
ys = np.append(y, np.ones(sum(y == 0) - sum(y == 1), dtype=int))

plt.bar([0, 1], height=[sum(ys == 0), sum(ys == 1)], color=['cornflowerblue', 'lime'])
plt.bar([0, 1], height=[sum(y == 0), sum(y == 1)], color='cornflowerblue')
plt.xticks([0, 1])
plt.show()

plt.bar with counts

JohanC
  • 71,591
  • 8
  • 33
  • 66
  • Magic! This is exactly what I have been struggling to do. Many thanks. –  Mar 08 '22 at 21:19