15

I'm currently working on a case study for which I need to work on the MNIST database.
The files in this site are said to be in IDX file format. I tried to take a look at these files using basic text editors like notepad and wordpad, but no luck there.
Expecting that they would be in the high endian format, I tried the following:

to.read = file("t10k-images.idx3-ubyte", "rb")
readBin(to.read, integer(), n=100, endian = "high")

I got some numbers as output, but none of them made any sense to me.

Can anyone please explain how to read the MNIST database files in R and how to interpret those numbers? Thanks.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
StrikeR
  • 1,598
  • 5
  • 18
  • 35

5 Answers5

23

endian="big", not "high":

> to.read = file("~/Downloads/t10k-images-idx3-ubyte", "rb")

magic number:

> readBin(to.read, integer(), n=1, endian="big")
[1] 2051

number of images:

> readBin(to.read, integer(), n=1, endian="big")
[1] 10000

number of rows:

> readBin(to.read, integer(), n=1, endian="big")
[1] 28

number of columns:

> readBin(to.read, integer(), n=1, endian="big")
[1] 28

here comes the data:

> readBin(to.read, integer(), n=1, endian="big")
[1] 0
> readBin(to.read, integer(), n=1, endian="big")
[1] 0

as per the training set image data description on the web site.

Now you just need to loop and read 28*28 byte chunks into matrices.

Start again:

 > to.read = file("~/Downloads/t10k-images-idx3-ubyte", "rb")

skip header:

> readBin(to.read, integer(), n=4, endian="big")
[1]  2051 10000    28    28

should really get the 28,28 from the header read but hard-coded here:

 > m = matrix(readBin(to.read,integer(), size=1, n=28*28, endian="big"),28,28)
 > image(m)

Might need to transpose or flip the matrix, I think its an upside-down "7".

par(mfrow=c(5,5))
par(mar=c(0,0,0,0))
for(i in 1:25){m = matrix(readBin(to.read,integer(), size=1, n=28*28, endian="big"),28,28);image(m[,28:1])}

gets you:

enter image description here

Oh, and google leads me to: http://www.inside-r.org/packages/cran/darch/docs/readMNIST which might be useful.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • Wow! Thanks a lot for the descriptive answer. – StrikeR Feb 03 '14 at 13:00
  • 1
    I was just going to write 'endian="big" not "high"' but got a bit carried away. You might also be able to read them using the raster package... – Spacedman Feb 03 '14 at 15:46
  • Thanks for the answer but I have a question. This code line doesn't work for me : " par(mar=c(0,0)) ". It gives this error: "Error in par(mar = c(0, 0)) : graphical parameter "mar" has the wrong length" – merve bıçakçı Apr 05 '16 at 09:27
  • 1
    Try `par(mar=c(0,0,0,0))` instead. I wonder if R used to repeat the `mar` vector until it was length=4 and something has changed... – Spacedman Apr 05 '16 at 11:54
  • 2
    @Spacedman - I tried using the code that you mentioned but the images that are showing up in my plot window do not make any sense, they are all jumbled up. Even the magic number and all the other values that you have mentioned in the comments are very different for me. For examplereadBin (to.read, integer(), n=4, endian="big") [1] 1195909779 -2010897709 -580149833 -809942066 Is it supposed to happen? – Aayush Agrawal Jan 05 '17 at 10:14
  • @Spacedman - You need to set "signed = F" when you call readBin for the image file so that you get a number in the range of 0 to 255. Your code currently interprets some of the bytes as negative. – Chechy Levas Jun 01 '17 at 07:54
  • I don't understand... from "magic number" through "number of columns" the commands are identical. For me they all return 0. Additionally, I am having the same issue as @AayushAgrawal. I can plot the images but they are all offset, centered in the space between the numbers so the numbers are cut off at the left and right borders. – syntonicC Apr 03 '18 at 18:23
  • I get `Error in par(mar = c(0, 0)) : graphical parameter "mar" has the wrong length`. According to http://rfunction.com/archives/1302 it should have a length of 4. `mar=c(3.5, 3.5, 2, 1)` gave a good result. Maybe this is what you mean? – Z boson May 22 '19 at 08:10
6

MNIST dataset is also available in the keras package.

library(keras)
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y
Imran Kocabiyik
  • 419
  • 5
  • 15
3

Following up on the darch (not ~Darch~) package mentioned above:

The package is called darch. It has been moved to MRAN (Microsoft R Application Network) but is available on CRAN as well.

It provides two functions for the MNIST data:

readMNIST which reads the ubyte files stored in your hard drive and saves them as test.Rdata and train.Rdata archives.

provideMNIST which will download the files and call readMNIST on them.

When calling these functions you need to give the directory names separated by a single slash e.g. readMNIST("..\MNIST\") (last slash required).

If you download the files yourself you will need to change the file names: the gz archives contain files with extensions, like t10k-labels.idx1-ubyte but readMNIST looks for files without extension, like t10k-labels-idx1-ubyte, so you have to change the dot to a dash (with darch version 0.12.0, maybe they'll fix this).

To load the files in R you need to use the load function (e.g. load("..\\MNIST\\test.Rdata"). This will create the matrices trainData and testData in the environment.

For some reason I did not get any dimnames for the matrices.

ubomb
  • 9,438
  • 2
  • 20
  • 26
Marco Stamazza
  • 836
  • 9
  • 15
2

Here's how you can do it using Darch package:

Run readMNIST('C:/Users/pj_/Dir/')

Which will store test.RData and train.RData in your set directory. When you load these two files in your Workspace, you will be able to see 'testData', 'testLabels', 'trainData' and 'trainLabels' in your Global Environment.

Pj_
  • 824
  • 6
  • 15
2

I tried the above, using:

data <- readBin(to.read, integer(), size = 1, n = 784, endian="big")

but ended up with both positive and negative integers in the image. Consequently, when plotted, using:

plot(as.cimg(data))

I get a grey background with the character in pixels that are darker or lighter than the background.

I then used: (see [1]https://tensorflow.rstudio.com/tfestimators/articles/examples/mnist.html)

data <- readBin(to.read, what = "raw", n = 784, endian="big")
conv <- as.integer(data)
mm <- matrix(conv, 28, 28)

Now I have only positive values (0 to 255), and the plot gives a proper white character on a black background. Which is what I wanted.

Ben
  • 21
  • 1