8

I have instantiated a SVC object using the sklearn library with the following code:

clf = svm.SVC(kernel='linear', C=1, cache_size=1000, max_iter = -1, verbose = True)

I then fit data to it using:

model = clf.fit(X_train, y_train)

Where X_train is a (301,60) and y_train is (301,) ndarray (y_train consisting of class labels "1", "2" and "3").

Now, before I stumbled across the .score() method, to determine the accuracy of my model on the training set i was using the following:

prediction = np.divide((y_train == model.predict(X_train)).sum(), y_train.size, dtype = float)

which gives a result of approximately 62%.

However, when using the model.score(X_train, y_train) method I get a result of approximately 83%.

Therefore, I was wondering if anyone could explain to me why this should be the case because as far as I understand, they should return the same result?

ADDENDUM:

The first 10 values of y_true are:

  • 2, 3, 1, 3, 2, 3, 2, 2, 3, 1, ...

Whereas for y_pred (when using model.predict(X_train)), they are:

  • 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, ...
precicely
  • 511
  • 6
  • 17
  • That's weird, can you post some subset of your data (at least some `y_true` and `y_pred` values)? – elyase Jan 22 '15 at 23:32

1 Answers1

6

Because your y_train is (301, 1) and not (301,) numpy does broadcasting, so

(y_train == model.predict(X_train)).shape == (301, 301)

which is not what you intended. The correct version of your code would be

np.mean(y_train.ravel() == model.predict(X_train))

which will give the same result as

model.score(X_train, y_train)
Andreas Mueller
  • 27,470
  • 8
  • 62
  • 74
  • Unfortunately, i was incorrect when stating the question, y_train is in fact a (301,) - my mistake (question has been edited)! – precicely Jan 23 '15 at 10:30
  • That being said, when using `np.mean(y_train.ravel() == model.predict(X_train))` I still get a training accuracy of 60ish percent. :( – precicely Jan 23 '15 at 10:37
  • What is shape and dtype of ``y_train``, ``X_train``, model.predict(X_train)`` and ``y_train == model.predict(X_train)``? – Andreas Mueller Jan 23 '15 at 19:17
  • `y_train` : (301,) int64; `X_train` : (301, 60) float64; `model.predict(X_train)` : (301,) int64; `y_train == model.predict(X_train)` : bool; Does that help? – precicely Jan 26 '15 at 11:10
  • Is ``y_train == model.predict(X_train)`` just a bool or an array of dtype bool? If so, what is the shape? This is really odd. Can you share the data? – Andreas Mueller Jan 26 '15 at 23:00
  • `y_train == model.predict(X_train)` is a (301,) of dtype bool. I'll see what I can do about providing the data - unfortunately its a non-trivial process... – precicely Jan 28 '15 at 10:18
  • 1
    Turns out, due to a nuance in the way I way handling my data set, X_train was slightly modified between the two function calls, hence the discrepancy in the accuracy results. Thank you for your help and I apologise for sending you on a wild goose chase. Cheers! – precicely Jan 28 '15 at 10:47