I am working on a data mining project and I am using the sklearn package in python for classifying my data.
in order to train my data and evaluate the quality of the predicted values, I am using the sklearn.cross_validation.cross_val_predict function.
however, when I try to run my model on the test data, it asks for the base class, which are not available.
I have seen (possible) work-arounds using the sklearn.grid_search.GridSearchCV function but am loathe to use such a method for a fixed set of parameters.
going throught the sklearn.cross_validation documentation, I have come across the cross_val_score function. Since I am fairly new to the world of classification problems I am not quite sure if this the function which would solve my problem.
Any help will be awesome!
Thanks!
edit:
Hello! I get the impression I was fairly vague with my original query. I'll try to detail what it is that I am exactly doing. Here goes:
I have generated 3 numpy.ndarrays X,X_test and y with nrows = 10158, 22513 and 10158 respectively which correspond to my train data, test data and class labels for the train data.
Thereafter, I run the following code :
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_predict
clf = SVC()
testPred = cross_val_predict(clf,X,y,cv=2)
This works fine and I can then use stemPred and y as mentioned in the tutorials.
However, I am looking to predict the classes of X_test. The error message is rather self-explanatory and says:
ValueError: Found arrays with inconsistent numbers of samples: [10158 22513]
The current work around (I do not know if this is a work around or the only way to do it) I am using is:
from sklearn import grid_search
# thereafter I create the parameter grid (grid) and appropriate scoring function (scorer)
model = grid_search.GridSearchCV(estimator = clf, param_grid = grid, scoring = scorer, refit = True, cv = 2, n_jobs = -1)
model.fit(X,y)
model.best_estimator_.fit(X,y)
testPred = model.best_estimator_.predict(X_test)
This technique works fine for the time-being; however, if I didn't have to use the GridSearchCV function I'd be able to sleep much better.