How to predict on new data using Pybrain?

Question

What I want to do is ask Pybrain to predict on new data, for example predict(0,1,0,1,1,0) and it should output what the answer it thinks it would be.

The question is, what code do I need to paste to make this happen?

Additional info: the weather.csv file that Pybrain is learning on has 6 attributes and the answer can only be 1 or 0. No other number.

Again all I want to do is ask pyBrain after it has learned to predict on numbers I give it. like this for example predict(0,1,0,1,1,0) and it should out an answer. I am very new to Python and Pybrain.

This is my code so far:

from pybrain.datasets import SupervisedDataSet
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer

from pybrain.datasets            import ClassificationDataSet
from pybrain.utilities           import percentError
from pybrain.tools.shortcuts     import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules   import SoftmaxLayer

from pylab import ion, ioff, figure, draw, contourf, clf, show, hold, plot
from scipy import diag, arange, meshgrid, where
from numpy.random import multivariate_normal

ds = SupervisedDataSet(6,1)

tf = open('weather.csv','r')

for line in tf.readlines():
    try:
        data = [float(x) for x in line.strip().split(',') if x != '']
        indata =  tuple(data[:6])
        outdata = tuple(data[6:])
        ds.addSample(indata,outdata)
    except ValueError,e:
            print "error",e,"on line"


n = buildNetwork(ds.indim,8,8,ds.outdim,recurrent=True)
t = BackpropTrainer(n,learningrate=0.001,momentum=0.05,verbose=True)
t.trainOnDataset(ds,3000)
t.testOnData(verbose=True)

Update:

My weather.csv file has a total of only 7 observations (just for testing purposes for now). It looks like this inside the csv file (the data was extracted from one week in 1970):

1   0   1   1   1   1   1
0   0   0   1   1   1   0
1   0   1   1   1   1   1
0   0   0   1   1   1   0
0   0   0   1   1   1   0
0   0   0   1   1   1   0
0   0   0   1   1   1   0

The last column (far right) is the one Pybrain needs predicts. When I run the code and tell Pybrain to train on this little data set 3000 times (I want to overfit). The output I get is

Total error: 0.0140074590407
Total error: 0.0139930126505
Total error: 0.0139796724323
Total error: 0.0139656881439

Testing on data:
out:     [  0.732]
correct: [  1.000]
error:  0.03581333
out:     [  0.101]
correct: [  0.000]
error:  0.00511758
out:     [  0.732]
correct: [  1.000]
error:  0.03581333
out:     [  0.101]
correct: [  0.000]
error:  0.00511758
out:     [  0.101]
correct: [  0.000]
error:  0.00511758
out:     [  0.101]
correct: [  0.000]
error:  0.00511758
out:     [  0.101]
correct: [  0.000]
error:  0.00511758

Now I just want to tell pybrain with the over fitted model that it has trained to predict on new data in 2014. But I don't know how. My goal is to see how well the over fitted model does on new data in 2014.

So, what happens when you run this code? How is it different than what you expect? Is the last line not giving you predictions? — rossdavidh, Sep 28 '14 at 15:05
The last line gives me a prediction (if i leave the last column empty) but it still has trained on the data. and that's what I am trying to avoid. — ben olsen, Sep 29 '14 at 00:24

score 5 · Accepted Answer · edited May 31 '18 at 21:00

5

If I understand your question correctly, you want to use the activate function. For example, if you add these two lines to the end of your code above:

data2014 = n.activate([0,1,0,1,0,1])
print 'data2014',data2014

...it will print out the output for a single row. Of course, you probably want to predict for more than a single row, so you will want to read in a second csv, use the activate function in a loop, etc. But this should give you the basic idea.

edited May 31 '18 at 21:00

halfer

19,824
17
99
186

answered Sep 30 '14 at 17:52

rossdavidh

1,966
2
22
33

I have one quick question to ask. When I upload the full weather.csv file with 4000 data points. Its SUPER SLOW would you know how I could speed it up? Right now I have 8G Ram with i7 processor. – ben olsen Oct 06 '14 at 08:29
How slow do you mean? – rossdavidh Oct 06 '14 at 13:11
When the weather.csv file only had 7 data points and I trained it 3000 times. It would be able to complete the process within 5 minutes. Now the new weather.csv file has 4000 data points and it takes it 7 hours. – ben olsen Oct 06 '14 at 18:57
Well if it takes 5 minutes to do 7 points then that might be real (i.e. not a result of running out of memory or whatnot). Do you need to train on the entire dataset? Try training on a randomly selected subset and see if the predictions are meaningfully different. It may be that after 50 or 100 or 500 data points you already arrive at your more or less final answer. Also, you could use trainUntilConvergence instead of trainOnDataset, it may be that you don't need 3000 iterations to converge on your final answer. – rossdavidh Oct 07 '14 at 13:30
I tried trainUntilConvergence results(ouput) would be always around 50%. What I am think is using a Amazon AWS EC2 instance. Would you know anything about Amazon AWs? and if yes can you recommend one that is fast. – ben olsen Oct 12 '14 at 01:20

How to predict on new data using Pybrain?

1 Answers1