predict classes of test data using k folding using sklearn

Question

I am working on a data mining project and I am using the sklearn package in python for classifying my data.

in order to train my data and evaluate the quality of the predicted values, I am using the sklearn.cross_validation.cross_val_predict function.

however, when I try to run my model on the test data, it asks for the base class, which are not available.

I have seen (possible) work-arounds using the sklearn.grid_search.GridSearchCV function but am loathe to use such a method for a fixed set of parameters.

going throught the sklearn.cross_validation documentation, I have come across the cross_val_score function. Since I am fairly new to the world of classification problems I am not quite sure if this the function which would solve my problem.

Any help will be awesome!

Thanks!

edit:

Hello! I get the impression I was fairly vague with my original query. I'll try to detail what it is that I am exactly doing. Here goes:

I have generated 3 numpy.ndarrays X,X_test and y with nrows = 10158, 22513 and 10158 respectively which correspond to my train data, test data and class labels for the train data.

Thereafter, I run the following code :

    from sklearn.svm import SVC
    from sklearn.cross_validation import cross_val_predict
    clf = SVC()
    testPred = cross_val_predict(clf,X,y,cv=2)

This works fine and I can then use stemPred and y as mentioned in the tutorials.

However, I am looking to predict the classes of X_test. The error message is rather self-explanatory and says:

    ValueError: Found arrays with inconsistent numbers of samples: [10158 22513]

The current work around (I do not know if this is a work around or the only way to do it) I am using is:

    from sklearn import grid_search
    # thereafter I create the parameter grid (grid) and appropriate scoring function (scorer)
    model = grid_search.GridSearchCV(estimator = clf, param_grid = grid, scoring = scorer, refit = True, cv = 2, n_jobs = -1)
    model.fit(X,y)
    model.best_estimator_.fit(X,y)
    testPred = model.best_estimator_.predict(X_test)

This technique works fine for the time-being; however, if I didn't have to use the GridSearchCV function I'd be able to sleep much better.

What do you mean by 'base class not available'? Can you post the error message you see? How many classes you have in your y labels? I can write a sample code to demonstrate how to use sklearn to train and test classification problem. — Jianxun Li, Jun 27 '15 at 14:46
@JianxunLi Hi! Thanks for the offer! I have edited my original post to make my question more clear. Also, I am working with 4 classes. Thanks again! — AnirudhJ, Jun 27 '15 at 19:53
Ah, much better! IIUC, you're mixing up a bit of stuff. Will update my answer. — Ami Tavory, Jun 27 '15 at 20:00

Ami Tavory · Accepted Answer · 2015-06-27T20:23:14.427

IIUC, you're conflating different things.

Suppose you have a classifier with a given scheme. Then you can train it on some data, and predict (usually other) data. This is quite simple, and looks like this.

First we build the predictor and fit it.

from sklearn import svm, grid_search, datasets
from sklearn.cross_validation import train_test_split
iris = datasets.load_iris()
clf = svm.SVC()
train_x, test_x, train_y, test_y = train_test_split(iris.data, iris.target)
>> clf.fit(train_x, train_y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

Now that it is completely constructed, you can use it to predict.

>> clf.predict(test_x)
array([1, 0, 0, 2, 0, 1, 1, 1, 0, 2, 2, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 2, 0,
   1, 0, 2, 0, 2, 1, 2, 1, 2, 2, 2, 1, 0, 0, 0])

It's as simple as that.

What has happened here?

The classifier has a completely specified scheme - it just needs to tune its parameters
The classifier tunes its parameters given the train data
The classifier is ready to predict

In many cases, the classifier has a scheme that it needs to tune using parameters, but it also has meta-parameters. An example is the degree argument for your classifier.

How should you tune them? There are a number of ways.

Don't. just stick with the defaults (that's what my example did)
Use some form of cross-validation (e.g., grid search)
Use some measure of complexity, e.g., AIC, BIC, etc.

So it's important not to mix these things up. Cross-validation is not some trick to get a predictor for the test data. The predictor with the default arguments can already do that. Cross validation is for tuning meta parameters. Once you choose them, you tune the parameters. Then you have a different predictor.

Hi Ami, I have edited my question. Hopefully I am able to put across my dilemma better now. Thanks :) — AnirudhJ, Jun 27 '15 at 19:56
Your question is much better now, and allowed me (I think) to understand it. See the update. — Ami Tavory, Jun 27 '15 at 20:13

predict classes of test data using k folding using sklearn

Thanks!

1 Answers1