I'm training a ConvNet using Keras and Theano, but before doing that I decided to take a peek into the dataset, its data samples and classes... And I don't like what I'm seeing.
I'm using the following code to load both training and test datasets and count how many data samples are labeled for each one:
import numpy as np
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
train_classes = [0,0,0,0,0,0,0,0,0,0]
test_classes = [0,0,0,0,0,0,0,0,0,0]
for i in y_train:
train_classes[y_train[i]] = train_classes[y_train[i]] + 1
for i in y_test:
test_classes[y_test[i]] = test_classes[y_test[i]] + 1
print('Training classes: ', train_classes)
print('\nTesting classes: ', test_classes)
... And the results are worrying:
(ann) C:\Users\shado\mnist>python statistics.py
Using Theano backend.
Training classes: [6742, 17900, 5421, 6265, 11907, 5923, 0, 0, 0, 5842]
Testing classes: [1010, 1924, 1135, 0, 1940, 974, 0, 980, 0, 2037]
So as you can see from the label counts, the training dataset is missing the '6', '7' and '8' classes, while the testing dataset is missing the '3', '6' and '8' classes. And of course, the class distribution is all over the place, specially on the training dataset.
Am I downloading the wrong dataset? Am I missing something here?