How to create dataset similar to cifar-10

Question

I want to create a dataset that has the same format as the cifar-10 data set to use with Tensorflow. It should have images and labels. I'd like to be able to take the cifar-10 code but different images and labels, and run that code.

score 25 · Accepted Answer · edited Jan 23 '21 at 00:46

First we need to understand the format in which the CIFAR10 data set is in. If we refer to: https://www.cs.toronto.edu/~kriz/cifar.html, and specifically, the Binary Version section, we see:

the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

Intuitively, we need to store the data in this format. What you can do next as sort of a baseline experiment first, is to get images that are exactly the same size and same number of classes as CIFAR10 and put them in this format. This means that your images should have a size of 32x32x3 and have 10 classes. If you can successfully run this, then you can go further on to factor cases like single channels, different size inputs, and different classes. Doing so would mean that you have to change many variables in the other parts of the code. You have to slowly work your way through.

I'm in the midst of working out a general module. My code for this is in https://github.com/jkschin/svhn. If you refer to the svhn_flags.py code, you will see many flags there that can be changed to accommodate your needs. I admit it's cryptic now, as I haven't cleaned it up such that it is readable, but it works. If you are willing to spend some time taking a rough look, you will figure something out.

This is probably the easy way to run your own data set on CIFAR10. You could of course just copy the neural network definition and implement your own reader, input format, batching, etc, but if you want it up and running fast, just tune your inputs to fit CIFAR10.

EDIT:

Some really really basic code that I hope would help.

from PIL import Image
import numpy as np

im = Image.open('images.jpeg')
im = (np.array(im))

r = im[:,:,0].flatten()
g = im[:,:,1].flatten()
b = im[:,:,2].flatten()
label = [1]

out = np.array(list(label) + list(r) + list(g) + list(b),np.uint8)
out.tofile("out.bin")

This would convert an image into a byte file that is ready for use in CIFAR10. For multiple images, just keep concatenating the arrays, as stated in the format above. To check if your format is correct, specifically for the Asker's use case, you should get a file size of 4274273 + 1 = 546988 bytes. Assuming your pictures are RGB and values range from 0-255. Once you verify that, you're all set to run in TensorFlow. Do use TensorBoard to perhaps visualize one image, just to guarantee correctness.

EDIT 2:

As per Asker's question in comments,

if not eval_data:
    filenames = [os.path.join(data_dir, 'data_batch_%d.bin' % i)
                 for i in xrange(1, 6)]

If you really wanna it to work as it is, you need to study the function calls of CIFAR10 code. In cifar10_input, the batches are hardcoded. So you have to edit this line of code to fit the name of the bin file. Or, just distribute your images into 6 bin files evenly.

Thanks a lot! If I may ask, I'm not sure if that's somewhere in your code, but that is what I'm mainly confused about: I have images (right now jpeg) and labels (let's assume 1 hot vectors). How do I read both of those in so that they fit the format I need? Or how do I have to convert them into the required format (with the bytes). If that's in your code, I don't find it. My question is how do I convert a jpeg+label to a list of bytes which are label as well as channels? — BlackyTheCat, Jan 27 '16 at 10:29
It really depends on your data set. I don't usually upload my parsers because they aren't universal. May I know which images are you using? Also, it's easier to store the label as an integer, as that's how CIFAR10 was coded. I can write you some code and update the answer. — jkschin, Jan 27 '16 at 10:32
I am using jpegs (galaxy photos, I want to classify galaxies in the end). They can be scaled to any size, right now they are 427x427. The labels I will convert to integers most probably (I guess you mean 0 to 9 or 1 to 10, right?). — BlackyTheCat, Jan 27 '16 at 11:55
0-9 only for CIFAR10. I mean of course if you specify 20 classes, then 0-19. Refer to my previous post here http://stackoverflow.com/questions/34759227/tensorflow-cifar10-example. I had a hard time troubleshooting this careless mistake. I have posted some code in the edit, under the assumption that the image is in RGB. I hope the code helps. — jkschin, Jan 27 '16 at 12:25
Thanks! am I understanding correctly that I can run that sample code, save it in a .bin, and then basically replace the directory/file name in your cifar code with that new file name, and run it, and that should work? (For RGB images and so on). — BlackyTheCat, Jan 27 '16 at 13:27
Well, sort of. I mean if you just run the python code, you're just gonna get an image in a .bin format. Yes it would work, but you're essentially training on 1 image, so I guess it does not fit your purpose. You will have to edit the code for a use case with more images. The naming is not that simple too, refer to EDIT 2. — jkschin, Jan 27 '16 at 13:40
I understood how it would work for a single image. How do I concatenate multiple images in binary format? i.e. How do I add next image to `out` variable? — exAres, Sep 27 '16 at 07:17
The `ravel` function of numpy seems to be nicer than the list approach. Have a look to [this](http://stackoverflow.com/questions/41863336/confused-during-reshaping-array-of-image) question. — So S, Feb 28 '17 at 22:08

score 2 · Answer 2 · answered Jun 14 '16 at 21:18

2

I didn't find any of the answers to do what I wanted to I made my own solution. It can be found on my github here: https://github.com/jdeepee/machine_learning/tree/master

This script will convert and amount of images to training and test data where the arrays are the same shape as the cifar10 dataset.

The code is commented so should be easy enough to follow. I should note it iterated through a master directory containing multiple folders which contain the images.

answered Jun 14 '16 at 21:18

Joshua

2,979
1
14
20

What about the labels? Where are you reading them from? Could you provide a sample directory structure? – Artur Barseghyan Feb 15 '18 at 09:13
1

From memory I believe each directory in the input directory would correspond to an image label. So if you were classifying for images of dogs, cats and birds you would want three directories: dogs, cats and birds with the corresponding images in said directories. Classification label 1 would then indicate a "dog" classification 2 a "cat" and so on. – Joshua Feb 28 '18 at 12:47
I have already figured it out for myself, but thanks for the answer! – Artur Barseghyan Feb 28 '18 at 22:23
@Joshua I have a question. As per your code, you have image arrays and index arrays separately.How to combine them to use in the following code implementation for vqvae : https://github.com/MishaLaskin/vqvae – shome Nov 03 '22 at 18:41

score 1 · Answer 3 · answered Jun 03 '16 at 13:01

for SVHN dataset You can try like this for multiple input images:

import numpy as np
import scipy.io 

mat = scipy.io.loadmat('train_32x32.mat')
data = mat['X']
label = mat['y']

R_data = data[:,:,0,:]
G_data = data[:,:,1,:]
B_data = data[:,:,2,:]

R_data = np.transpose(R_data, (2,0,1))
G_data = np.transpose(G_data, (2,0,1))
B_data = np.transpose(B_data, (2,0,1))

R_data = np.reshape(R_data,(73257,32*32))
G_data = np.reshape(G_data,(73257,32*32))
B_data = np.reshape(B_data,(73257,32*32))

outdata = np.concatenate((label,R_data,G_data,B_data), axis = 1)
step = 10000
for i in range(1,6):
    temp = outdata[i*step:(i+1)*step,:]
    temp.tofile('SVHN_train_data_batch%d.bin' % i)
    print('save data %d' % i)

How to create dataset similar to cifar-10

3 Answers3

Linked