1

I am using Python and I have several directories of images I'd like to convert to .gz files so I can follow along the lasagne tutorial. The tutorial utilizes training images stored in a single .gz file. I'm trying to convert my directory of images into a .gz as well so I can emulate the tutorial code and better understand it.

In particular, I'm trying to understand the format of the MNIST .gz files like train-images-idx3-ubyte.gz found at Dr. LeCun's website.

I am able to convert a single image to a .gz, but not a directory. My online searches suggest this should be expected. How would I create a .gz file containing the information for multiple training images?

Please let me know if you need more information or if I'm asking the wrong question or heading in an insensible direction. Thanks.

lejlot
  • 64,777
  • 8
  • 131
  • 164
user2205916
  • 3,196
  • 11
  • 54
  • 82

1 Answers1

2

You cannot. gzip is a stream compression method, it is not a container. In this case, the images are stored in a file container, which is described at the bottom of the page:

the IDX file format is a simple format for vectors and multidimensional matrices of various numerical types. The basic format is magic number size in dimension 0 size in dimension 1 size in dimension 2 ..... size in dimension N data

The magic number is an integer (MSB first). The first 2 bytes are always 0.

The third byte codes the type of the data: 0x08: unsigned byte 0x09: signed byte 0x0B: short (2 bytes) 0x0C: int (4 bytes) 0x0D: float (4 bytes) 0x0E: double (8 bytes)

The 4-th byte codes the number of dimensions of the vector/matrix: 1 for vectors, 2 for matrices....

The sizes in each dimension are 4-byte integers (MSB first, high endian, like in most non-Intel processors).

The data is stored like in a C array, i.e. the index in the last dimension changes the fastest.

A more typical approach is to use a tarball archive, as a container, and then compress the archive. The benefit is that this is a standard way of creating gzip-compressed archives, and does not require a custom script to extract the files.

An example on how to do this with a given directory of images is as follows (using Bash, on a *Nix system):

tar -zcvf tar-archive-name.tar.gz source-folder-name

Gzip compression is built-in with the -z flag, or you can also use the gzip command to do your own.

In Python, you can also create a tarfile archive, with gzip compression:

A simple example, modified from the documentation, is as follows:

import tarfile
tar = tarfile.open("sample.tar", "w:gz")
for name in ["foo", "bar", "quux"]:
    tar.add(name)
tar.close()

The mode 'w:gz' specifies that the archive will be gzip compressed, and this will work on any operating system.

Alex Huszagh
  • 13,272
  • 3
  • 39
  • 67
  • Thanks for your answer. I tried to above and it ends up not working appropriately for my intended use with a pickle.load. But it led me to another answer: http://stackoverflow.com/questions/26898682/how-to-create-a-file-like-the-mnist-dataset. My question is the same as that person's except, is there a way to do the same thing that Lush (which I don't know) does, except in either R or Python? I can phrase this as a new question if needed. – user2205916 Apr 27 '16 at 01:10
  • It's possible but is their any reason why you want the above format? That format is not standardized, and a tar archive with gzip compression, or zip archive is much more compatible with other systems. If you need massive compression, maybe check out PNGOUT? PNG already has lossless compression built in, so using gzip compression on PNG files is unlikely to do much. If you need to above format, I can look into helping you. – Alex Huszagh Apr 27 '16 at 01:16
  • I'm trying to train my own convolutional neural network with my own set of images (15 GB). I wanted to start by working through the MNIST data set at https://github.com/Lasagne/Lasagne/blob/master/examples/mnist.py. The script there deals with MNIST data in the .gz format. While your suggestion helped me successfully convert my directory to a .tar.gz, the pickle.load command on it, as written in the example script, failed. – user2205916 Apr 27 '16 at 01:21