How to deal with data not fitting into memory in pybrain

Question

I have training set consisting of ~2k 300x400 pxs greyscale images. Whole collection has size ~20 Mb. I'm trying to classify these images with pybrain neural net. The problem is when I'm loading the dataset SupervisedDataSet my small python script consumes about 8 Gb memory which is actually too much.

So I have the questions: how can I learn this dataset with 10 gigs ram laptop? Is there way to load parts of the dataset "on demand" while learning? Is there way to split the dataset into smaller parts and feed it to the net one by one? I couldn't find the answers in pybrain documentation.

Here is how I build the dataset:

# returns ([image bytes], category) where category = 1 for apple, category = 0 for banana
def load_images(dir):
    data = []
    for d, n, files in os.walk(dir):
        for f in files:
            category = int(f.startswith('apple_'))
            im = Image.open('{}/{}'.format(d, f))
            data.append((bytearray(im.tobytes()), category))

    return data


def load_data_set(dir):
    print 'loading images'
    data = load_images(dir)

    print 'creating dataset'
    ds = SupervisedDataSet(120000, 1) #120000 bytes each image
    for d in data:
        ds.addSample(d[0], (d[1],))

    return ds

Thank you for any kind of help.

*Don't load it all into memory!*; Do the training iteratively in smaller chunks. — James Mills, May 24 '15 at 09:24
@JamesMills good point! I've been thinking about it. Actually I'm a newbie in pybrain and I lost in two functions `train()` and `trainUntilConvergence()`. How to use it properly with splitted dataset? If I call `train()` n times on different parts of dataset would it achieve the same result if I call once `trainUntilConvergence()` on whole dataset? — milo, May 24 '15 at 09:30
To be quite honest; I haven't a clue on that particular library either :) Sorry! But in general don't load 8GB worth of data into working memory if your system can't handle it :) — James Mills, May 24 '15 at 09:42
@JamesMills my set of images have size 20 MBs and I didn't expect it to take 8 gigs in dataset structure... So that's why I came with my question here :) — milo, May 24 '15 at 09:51

How to deal with data not fitting into memory in pybrain

0 Answers0