I'm having trouble getting my SVM to predict 0's and 1's where I would expect it to. It seems that after I train it and give it more data, it always wants to predict a 1 or a 0, but it will predict all 1's or all 0's, and never a mix of the two. I'm wondering if one of you could tell me what I'm doing wrong.
I've searched for "svm always predicting same value" and similar problems, and it looks like this is pretty common for those of us new to machine learning. I'm afraid though that I don't understand the answers that I've come across.
So I start off with this, and it more or less works:
from pyspark.mllib.regression import LabeledPoint
cooked_rdd = sc.parallelize([LabeledPoint(0, [0]), LabeledPoint(1, [1])])
from pyspark.mllib.classification import SVMWithSGD
model = SVMWithSGD.train(cooked_rdd)
I say "more or less" because
model.predict([0])
Out[47]: 0
is what I would expect, and...
model.predict([1])
Out[48]: 1
is also what I would expect, but...
model.predict([0.000001])
Out[49]: 1
is definitely not what I expected. I think that whatever is causing that is at the root of my problems.
Here I start by cooking my data...
def cook_data():
x = random()
y = random()
dice = 0.25 + (random() * 0.5)
if x**2 + y**2 > dice:
category = 0
else:
category = 1
return LabeledPoint(category, [x, y])
cooked_data = []
for i in range(0,5000):
cooked_data.append(cook_data())
... and I get a beautiful cloud of points. When I plot them I get a division with a little bit of a muddled area, but any kindergartner could draw a line to separate them. So why is that when I try drawing a line to separate them...
cooked_rdd = sc.parallelize(cooked_data)
training, testing = cooked_rdd.randomSplit([0.9, 0.1], seed = 1)
model = SVMWithSGD.train(training)
prediction_and_label = testing.map(lambda p : (model.predict(p.features), p.label))
...I can only lump them into one group, and not two? (Below is a list that shows tuples of what the SVM predicted, and what the answer should have been.)
prediction_and_label.collect()
Out[54]:
[(0, 1.0),
(0, 0.0),
(0, 0.0),
(0, 1.0),
(0, 0.0),
(0, 0.0),
(0, 1.0),
(0, 0.0),
(0, 1.0),
(0, 1.0),
...
And so on. It only ever guesses 0, when there should be a pretty obvious division where it should start guessing 1. Can anyone tell me what I'm doing wrong? Thanks for your help.
Edit: I don't think it's a problem with scale, as was suggested in some other posts with similar problems. I've tried multiplying everything by 100, and I still get the same problem. I also try playing with how I calculate my "dice" variable, but all I can do is change the SVM's guesses from all 0's to all 1's.