2

I used to create loop for finding the best parameters for my model which increased my errors in coding so I decided to use GridSearchCV.
I am trying to find out the best parameters for PCA for my model (the only parameter I want to grid search on).
In this model, after normalization I want to combine the original features with the PCA reduced features and then apply the linear SVM.
Then I save the whole model to predict my input on.

I have an error in the line where I try to fit the data so I can use best_estimator_ and best_params_ functions.
The error says: TypeError: The score function should be a callable, all (<type 'str'>) was passed. I did not use any parameters for which I might need to give string in GridSearchCVso not sure why I have this error

I also want to know if the line print("shape after model",X.shape) before saving my model, should should print (150, 7) and (150, 5) both based on all possible parameter?

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import StandardScaler
from sklearn.externals import joblib
from numpy import array

iris = load_iris()
X, y = iris.data, iris.target

print(X.shape) #prints (150, 4)
print (y)

#cretae models and piplline them
combined_features = FeatureUnion([("pca", PCA()), ("univ_select", SelectKBest(k='all'))])
svm = SVC(kernel="linear")

pipeline = Pipeline([("scale", StandardScaler()),("features", combined_features), ("svm", svm)])

# Do grid search over n_components:
param_grid = dict(features__pca__n_components=[1,3])

grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=5, verbose=10)
grid_search.fit(X, y)
print("best parameters", grid_search.best_params_)

print("shape after model",X.shape) #should this print (150, 7) or (150, 5) based on best parameter?

#save the model
joblib.dump(grid_search.best_estimator_, 'model.pkl', compress = 1)

#new data to predict
Input=[ 2.9 , 4.  ,1.2  ,0.2]
Input= array(Input)

#use the saved model to predict the new data
modeltrain="model.pkl"
modeltrain_saved = joblib.load(modeltrain) 
model_predictions = modeltrain_saved.predict(Input.reshape(1, -1))
print(model_predictions)

I updated the code based on the answers

Cœur
  • 37,241
  • 25
  • 195
  • 267
april
  • 131
  • 3
  • 11

1 Answers1

0

You are supplying 'all' as a param in SelectKBest. But according to the documentation, if you want to pass 'all', you need to specify it as:

SelectKBest(k='all')

The reason is that its a keyword argument, it should be specified with the keyword. Because the first argument to SelectKBest is a positional argument for the scoring function. So when you do not specify the param, 'all' is considered an input for the function and hence the error.

Update:

Now about the shape, the original X will not be changed. So it will print (150,4). The data will be changed on the fly and on my pc the best_param_ is n_components=1, so final shape that goes to svm is (150, 5), 1 from PCA and 4 from SelectKBest.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • @Vivek_Kumar Also, is "features__pca__n_components" a keyword that is recognized by GridSearchCV as a parameter for PCA. If not how it wouldn't miss recognize it for another param such as gamma of SVM which I did not define in param_grid? thanks – april May 01 '18 at 19:34
  • 1
    @april Yes `"features__pca__n_components"` is a keyword for the pipeline object. GridSearch just sends the keywords to pipeline. And then pipeline will handle the assignment, based on names used when making the pipeline. – Vivek Kumar May 02 '18 at 05:09
  • @Vivek_Kumar: I have seen one question you answered in another thread (https://stackoverflow.com/questions/49160206/does-gridsearchcv-perform-cross-validation) but I couldn't reply, it's kind of related to what I posted here. If I split my data into 75% for training data and 25% for test and perform grid_search over training data and after obtaining the best parameters I want to compute the accuracy of my model on the test data. Is it important if for the evaluation of my model on test data that I use the same number of cross validations as I used for grid search over training data? – april Jun 12 '18 at 14:16
  • 1
    @april No, you just call `predict()` or `score()` on test data for evaluation. You dont cross-validate on test data. If so, what will you do in each fold? – Vivek Kumar Jun 12 '18 at 14:18
  • @Vivek_Kumar: got it thanks, still I have a question about the necessity of having separate unseen data as our test data to evaluate if our final model is bias or not. grid_result.cv_results_ also calculates the mean_test_score and mean_train_score (for all possible parameters not only the best) so considering in each iteration i-th fold is unseen, we still should be able to say from the difference between mean-test and mean-train scores that our model is bias or not. – april Jun 12 '18 at 18:49