1

Using the the program at this link, https://leon.bottou.org/projects/infimnist, I generated some data.

As far as i can tell it is in some sort of binary format:

b"\x00\x00\x08\x01\x00\x00'\x10\x07\x02\x01\x00\x04\x01\x04\t\x05 ...

I need to extract labels and pictures from two datasets like this, generated with:

https://leon.bottou.org/projects/infimnist

with open("test10k-labels", "rb") as binary_file:
    data = binary_file.read()
    print(data)

>>> b"\x00\x00\x08\x01\x00\x00'\x10\x07\x02\x01\x00\x04\x01\x04\t\x05 ...

b"\x00\x00\x08\x01 ...".decode('ascii')

>>> "\x00\x00\x08\x01 ..."

I also tried the binascii package, but it did not work.

Thankful for any help!

Creating the Data

To create the dataset i am speaking download the package from the following link: https://leon.bottou.org/projects/infimnist.

$ cd dir_of_folder
$ make

Then I took the path of the resulting infimnist executable that pops up and:

$ app_path lab 10000 69999 > mnist60k-labels-idx1-ubyte

This should place the file i used in the folder.

The command after app_path can be replaced by any other command he lists on the side.

Final update

It works! Using some numpy functions the images can be returned to their normal orientation.

# for the labels
with open(path, "rb") as binary_file:
    y_train = np.array(array("B", binary_file.read()))

# for the images
with open("images path", "rb") as binary_file:
    images = []
    emnistRotate = True
    magic, size, rows, cols = struct.unpack(">IIII", binary_file.read(16))
    if magic != 2051:
        raise ValueError('Magic number mismatch, expected 2051,''got {}'.format(magic))
    for i in range(size):
        images.append([0] * rows * cols)
    image_data = array("B", binary_file.read())
    for i in range(size):
        images[i][:] = image_data[i * rows * cols:(i + 1) * rows * cols]

        # for some reason EMNIST is mirrored and rotated
        if emnistRotate:
            x = image_data[i * rows * cols:(i + 1) * rows * cols]

            subs = []
            for r in range(rows):
                subs.append(x[(rows - r) * cols - cols:(rows - r)*cols])

            l = list(zip(*reversed(subs)))
            fixed = [item for sublist in l for item in sublist]
            images[i][:] = fixed
x = []
for image in images:
    x.append(np.rot90(np.flip(np.array(image).reshape((28,28)), 1), 1))
x_train = np.array(x)

Crazy solution for such a simple thing :)

1 Answers1

0

Ok, so looking at the python-mnistsource, it seems the correct way to unpack the binary format is as follows:

from array import array
with open("test10k-labels", "rb") as binary_file:
    magic, size = struct.unpack(">II", file.read(8))
    if magic != 2049:
        raise ValueError("Magic number mismatch, expected 2049,got{}".format(magic))
    labels = array("B", binary_file.read())
    print(labels)

update

So I haven't tested this extensively, but the following code should work. It was taken and modified from the aforementioned python-mnistsee source

from array import array
import struct
with open("mnist8m-patterns-idx3-ubyte", "rb") as binary_file:
    images = []
    emnistRotate = True
    magic, size, rows, cols = struct.unpack(">IIII", binary_file.read(16))
    if magic != 2051:
        raise ValueError('Magic number mismatch, expected 2051,''got {}'.format(magic))
    for i in range(size):
        images.append([0] * rows * cols)
    image_data = array("B", binary_file.read())
    for i in range(size):
        images[i][:] = image_data[i * rows * cols:(i + 1) * rows * cols]

        # for some reason EMNIST is mirrored and rotated
        if emnistRotate:
            x = image_data[i * rows * cols:(i + 1) * rows * cols]

            subs = []
            for r in range(rows):
                subs.append(x[(rows - r) * cols - cols:(rows - r)*cols])

            l = list(zip(*reversed(subs)))
            fixed = [item for sublist in l for item in sublist]
            images[i][:] = fixed
    print(images)

previous answer:

You can use the python-mnist library:

from mnist import MNIST
mndata = MNIST('./data')
images, labels = mndata.load_training()
fluxens
  • 565
  • 3
  • 15
  • I know, but the Link i provided extends the mnist package up to 8 million pictures by morphing the original set. Its called the mnist8m dataset: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html. Any advice on how to solve the problem? –  Jun 03 '19 at 16:00
  • So, looking at the mnist8m dataset from http://csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html it does not look binary encoded. – fluxens Jun 03 '19 at 16:37
  • No but the program under the link i provided in the original question allows you to freely generate numbers, but only with binary output (as you can see in my code). Now I am interested on how to decode such a file format. –  Jun 03 '19 at 16:40
  • 1
    Could you please update your question with the steps you took to create the data? – fluxens Jun 03 '19 at 16:43
  • It works! I posted a final update. As you said i only had to rotate and mirror the images using some numpy functions. Thanks again! –  Jun 03 '19 at 20:10