loading EMNIST-letters dataset

Question

I have been trying to find a way to load the EMNIST-letters dataset but without much success. I have found interesting stuff in the structure and can't wrap my head around what is happening. Here is what I mean:

I downloaded the .mat format in here

I can load the data using

import scipy.io
mat = scipy.io.loadmat('letter_data.mat') # renamed for conveniance

it is a dictionnary with the keys as follow:

dict_keys(['__header__', '__version__', '__globals__', 'dataset'])

the only key with interest is dataset, which I havent been able to gather data from. printing the shape of it give this:

>>>print(mat['dataset'].shape)
(1, 1)

I dug deeper and deeper to find a shape that looks somewhat like a real dataset and came across this:

>>>print(mat['dataset'][0][0][0][0][0][0].shape)
(124800, 784)

which is exactly what I wanted but I cant find the labels nor the test data, I tried many things but cant seem to understand the structure of this dataset.

If someone could tell me what is going on with this I would appreciate it

I suggest you run it on Spyder and see it in Variable Explorer. — PyMatFlow, Jul 01 '18 at 19:08
it doesnt seem to work even in this, I cant explore the variable — Tissuebox, Jul 01 '18 at 19:22

Josh Payne · Accepted Answer · 2018-07-03T00:52:35.733

Because of the way the dataset is structured, the array of image arrays can be accessed with mat['dataset'][0][0][0][0][0][0] and the array of label arrays with mat['dataset'][0][0][0][0][0][1]. For instance, print(mat['dataset'][0][0][0][0][0][0][0]) will print out the pixel values of the first image, and print(mat['dataset'][0][0][0][0][0][1][0]) will print the first image's label.

For a less...convoluted dataset, I'd actually recommend using the CSV version of the EMNIST dataset on Kaggle: https://www.kaggle.com/crawford/emnist, where each row is a separate image, there are 785 columns where the first column = class_label and each column after represents one pixel value (784 total for a 28 x 28 image).

score 5 · Answer 2 · edited Jun 17 '20 at 09:47

@Josh Payne's answer is correct, but I'll expand on it for those who want to use the .mat file with an emphasis on typical data splits.

The data itself has already been split up in to a training and test set. Here's how I accessed the data:

    from scipy import io as sio
    mat = sio.loadmat('emnist-letters.mat')
    data = mat['dataset']

    X_train = data['train'][0,0]['images'][0,0]
    y_train = data['train'][0,0]['labels'][0,0]
    X_test = data['test'][0,0]['images'][0,0]
    y_test = data['test'][0,0]['labels'][0,0]

There is an additional field 'writers' (e.g. data['train'][0,0]['writers'][0,0]) that distinguishes the original sample writer. Finally, there is another field data['mapping'], but I'm not sure what it is mapping the digits to.

In addition, in Secion II D, the EMNIST paper states that "the last portion of the training set, equal in size to the testing set, is set aside as a validation set". Strangely, the .mat file training/testing size does not match the number listed in Table II, but it does match the size in Fig. 2.

    val_start = X_train.shape[0] - X_test.shape[0]
    X_val = X_train[val_start:X_train.shape[0],:]
    y_val = y_train[val_start:X_train.shape[0]]
    X_train = X_train[0:val_start,:]
    y_train = y_train[0:val_start]

If you don't want a validation set it is fine to leave these samples in the training set.

Also, if you would like to reshape the data into 2D, 28x28 sized images instead of a 1D 784 array, to get the correct image orientation you'll need to do a numpy reshape using Fortran ordering (Matlab uses column-major ordering, just like Fortran. reference). e.g. -

    X_train = X_train.reshape( (X_train.shape[0], 28, 28), order='F')

score 5 · Answer 3 · answered Jan 09 '20 at 06:09

An alternative solution is to use the EMNIST python package. (Full details at https://pypi.org/project/emnist/)

This lets you pip install emnist in your environment then import the datasets (they will download when you run the program for the first time).

Example from the site:

  >>> from emnist import extract_training_samples
  >>> images, labels = extract_training_samples('digits')
  >>> images.shape
  (240000, 28, 28)
  >>> labels.shape
  (240000,)

You can also list the datasets

 >>> from emnist import list_datasets
  >>> list_datasets()
  ['balanced', 'byclass', 'bymerge', 'digits', 'letters', 'mnist']

And replace 'digits' in the first example with your choice.

This gives you all the data in numpy arrays which I have found makes things easy to work with.

Marco Cerliani · Answer 4 · 2020-08-10T12:52:17.997

I suggest downloading the 'Binary format as the original MNIST dataset' from the Yann LeCun website.

Unzip the downloaded File and then with Python:

import idx2numpy

X_train = idx2numpy.convert_from_file('./emnist-letters-train-images-idx3-ubyte')
y_train = idx2numpy.convert_from_file('./emnist-letters-train-labels-idx1-ubyte')

X_test = idx2numpy.convert_from_file('./emnist-letters-test-images-idx3-ubyte')
y_test = idx2numpy.convert_from_file('./emnist-letters-test-labels-idx1-ubyte')

loading EMNIST-letters dataset

4 Answers4