Why is my Spark SVM always predicting the same label?

Question

I'm having trouble getting my SVM to predict 0's and 1's where I would expect it to. It seems that after I train it and give it more data, it always wants to predict a 1 or a 0, but it will predict all 1's or all 0's, and never a mix of the two. I'm wondering if one of you could tell me what I'm doing wrong.

I've searched for "svm always predicting same value" and similar problems, and it looks like this is pretty common for those of us new to machine learning. I'm afraid though that I don't understand the answers that I've come across.

So I start off with this, and it more or less works:

from pyspark.mllib.regression import LabeledPoint
cooked_rdd = sc.parallelize([LabeledPoint(0, [0]), LabeledPoint(1, [1])])
from pyspark.mllib.classification import SVMWithSGD
model = SVMWithSGD.train(cooked_rdd)

I say "more or less" because

model.predict([0])
Out[47]: 0

is what I would expect, and...

model.predict([1])
Out[48]: 1

is also what I would expect, but...

model.predict([0.000001])
Out[49]: 1

is definitely not what I expected. I think that whatever is causing that is at the root of my problems.

Here I start by cooking my data...

def cook_data():
  x = random()
  y = random()
  dice = 0.25 + (random() * 0.5)
  if x**2 + y**2 > dice:
    category = 0
  else:
    category = 1
  return LabeledPoint(category, [x, y])

cooked_data = []
for i in range(0,5000):
  cooked_data.append(cook_data())

... and I get a beautiful cloud of points. When I plot them I get a division with a little bit of a muddled area, but any kindergartner could draw a line to separate them. So why is that when I try drawing a line to separate them...

cooked_rdd = sc.parallelize(cooked_data)
training, testing = cooked_rdd.randomSplit([0.9, 0.1], seed = 1)
model = SVMWithSGD.train(training)
prediction_and_label = testing.map(lambda p : (model.predict(p.features), p.label))

...I can only lump them into one group, and not two? (Below is a list that shows tuples of what the SVM predicted, and what the answer should have been.)

prediction_and_label.collect()
Out[54]: 
[(0, 1.0),
 (0, 0.0),
 (0, 0.0),
 (0, 1.0),
 (0, 0.0),
 (0, 0.0),
 (0, 1.0),
 (0, 0.0),
 (0, 1.0),
 (0, 1.0),
...

And so on. It only ever guesses 0, when there should be a pretty obvious division where it should start guessing 1. Can anyone tell me what I'm doing wrong? Thanks for your help.

Edit: I don't think it's a problem with scale, as was suggested in some other posts with similar problems. I've tried multiplying everything by 100, and I still get the same problem. I also try playing with how I calculate my "dice" variable, but all I can do is change the SVM's guesses from all 0's to all 1's.

score 5 · Accepted Answer · answered Nov 02 '15 at 13:54

I figured out why it's always predicting either all 1's or all 0's. I need to add this line:

model.setThreshold(0.5)

That fixes it. I figured it out after using

model.clearThreshold()

clearThreshold, followed by predicting test data, told me what the computer was predicting down to a floating point, and not just to the binary 0 or 1 I'm ultimately looking for. I could see that the SVM was making what I considered a counterintuitive rounding decision. By using setThreshold, I'm now able to get much better results.

score 0 · Answer 2 · answered Oct 29 '15 at 21:34

SVMs are generally a very tuning-dependent model, and if you have a poor choice of parameters, you can get this degenerate behavior. I'd recommend starting with a more straightforward classification model type like logistic regression or decision trees/random forest, and get that working first to ensure you've got the surrounding code right.

Once that's set, if you still want to go deeper with SVMs, you can use cross-validated grid search to find better parameters for the model & dataset. Details on how to do that is more than a single Stack Overflow answer, but there's plenty of good reading on the web about it.

Actually I'm looking for good references on Cross Validation and Model Selection for SVM on SparkMLlib, in particular SVMWithSGD. Can't find much, do you have a pointer? — Alessandro S., Nov 25 '16 at 13:39

Why is my Spark SVM always predicting the same label?

2 Answers2

Linked