MNIST dataset missing classes

Question

I'm training a ConvNet using Keras and Theano, but before doing that I decided to take a peek into the dataset, its data samples and classes... And I don't like what I'm seeing.

I'm using the following code to load both training and test datasets and count how many data samples are labeled for each one:

import numpy as np
from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

train_classes = [0,0,0,0,0,0,0,0,0,0]
test_classes = [0,0,0,0,0,0,0,0,0,0]

for i in y_train:
    train_classes[y_train[i]] = train_classes[y_train[i]] + 1

for i in y_test:
    test_classes[y_test[i]] = test_classes[y_test[i]] + 1

print('Training classes: ', train_classes)
print('\nTesting classes: ', test_classes)

... And the results are worrying:

(ann) C:\Users\shado\mnist>python statistics.py
Using Theano backend.
Training classes:  [6742, 17900, 5421, 6265, 11907, 5923, 0, 0, 0, 5842]

Testing classes:  [1010, 1924, 1135, 0, 1940, 974, 0, 980, 0, 2037]

So as you can see from the label counts, the training dataset is missing the '6', '7' and '8' classes, while the testing dataset is missing the '3', '6' and '8' classes. And of course, the class distribution is all over the place, specially on the training dataset.

Am I downloading the wrong dataset? Am I missing something here?

Julien · Accepted Answer · 2018-10-21T23:51:54.690

1

The logic you need is:

for i in y_train:
    train_classes[i] += 1

since i is already the label.

Or equivalently:

for i in range(len(y_train)):
    train_classes[y_train[i]] += 1

Your current code is essentially randomly sampling the first 10 labels of each sets...

Note: you can also simply use: np.unique(y_train, return_counts=True).

edited Oct 21 '18 at 23:51

answered Oct 21 '18 at 23:44

Julien

13,986
5
29
53

MNIST dataset missing classes

1 Answers1