0

So I found out that StandardScaler() can make my RFECV inside my GridSearchCV with each on a nested 3-fold cross validation run faster. Without StandardScaler(), my code was running for more than 2 days, so I canceled and decided to inject StandardScaler into the process. But now it is has been running for more than 4 hours and I am not sure if I have done it right. Here is my code:

# Choose Linear SVM as classifier
LSVM = SVC(kernel='linear')

selector = RFECV(LSVM, step=1, cv=3, scoring='f1')

param_grid = [{'estimator__C': [0.001, 0.01, 0.1, 1, 10, 100]}]

clf = make_pipeline(StandardScaler(), 
                GridSearchCV(selector,
                             param_grid,
                             cv=3,
                             refit=True,
                             scoring='f1'))

clf.fit(X, Y)

I think I haven't gotten it right to be honest because I think the StandardScaler() should be put inside the GridSearchCV() function for it to normalize the data each fold, not only just once (?). Please correct me if I am wrong or if my pipeline is incorrect and hence why it is still running for a long time.

I have 8,000 rows of 145 features to be pruned by RFECV, and 6 C-Values to be pruned by GridSearchCV. So for each C-Value, the best feature set is determined by the RFECV.

Thanks!

Update:

So I put the StandardScaler inside the RFECV like this:

 clf = SVC(kernel='linear')

 kf = KFold(n_splits=3, shuffle=True, random_state=0)  

 estimators = [('standardize' , StandardScaler()),
               ('clf', clf)]

 class Mypipeline(Pipeline):
     @property
     def coef_(self):
         return self._final_estimator.coef_
     @property
     def feature_importances_(self):
         return self._final_estimator.feature_importances_ 

 pipeline = Mypipeline(estimators)
 rfecv = RFECV(estimator=pipeline, cv=kf, scoring='f1', verbose=10)

 param_grid = [{'estimator__svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]

 clf = GridSearchCV(rfecv, param_grid, cv=3, scoring='f1', verbose=10)

But it still throws out the following error:

ValueError: Invalid parameter C for estimator Pipeline(memory=None, steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, >with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, >coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False))]). Check the list of available parameters with >estimator.get_params().keys().

chmscrbbrfck
  • 71
  • 1
  • 2
  • 11
  • Yes, you are correct. That `make_pipeline` should be inside the RFECV containing `StandardScaler` and `SVC`. `GridSearchCV` should be outer. But even then, you cannot say for sure that the code will not finitely. Its just an issue of Linear SVM's not able to converge on the given data and may run for a long time. Combine that with the RFE and GridSearch, which will increase the running time. – Vivek Kumar Jan 07 '19 at 14:55
  • Okay but now it throws out an error (see the edit). – chmscrbbrfck Jan 07 '19 at 15:06
  • Since you have now changed the structure, you need to also change the parameter names. Correct name would be `estimator__svc__C`. But then you will face errors on `RFECV`. Because it needs the `coef_` of `SVC` which is not exposed by the pipeline. See [this question](https://stackoverflow.com/q/51415968/3374996) – Vivek Kumar Jan 07 '19 at 15:19
  • Also [see this](https://stackoverflow.com/a/51418655/3374996) for more explanation – Vivek Kumar Jan 07 '19 at 15:21
  • That is complex wow. So it turns out that StandardScaler() wouldn't speed up the process, so is it valid to not use it at all (not normalize the data (?)). I have used the same code (without StandardScaler()) for Logistic Regression, it only took 15 minutes and spit out a good accuracy. Now, I just want to train an SVM for comparison, so is it safe to assume that even though the problem is linear which LR resolved easily, it can still be hard for Linear SVM (?) – chmscrbbrfck Jan 07 '19 at 15:25
  • I am not saying it would not. In most cases it would definitely help the support vector machines to converge faster. You should try it. And maybe other SVMs (like RBF kernel) will converge faster. – Vivek Kumar Jan 07 '19 at 15:27
  • Thank you I should. Btw how do I program the param_grid parameters? sorry to sound ignorant, but I followed your advice to change the name to estimator__svc__C but it still didnt work. I have updated the code above.. Thanks – chmscrbbrfck Jan 07 '19 at 15:38
  • You changed the make_pipeline to Pipeline. Here instead of `svc` you should put the name that you gave to the SVM, ie `clf`. So the final name is `estimator__clf__C` – Vivek Kumar Jan 07 '19 at 15:40

1 Answers1

-1

Kumar is right. Also, what You might want to do, turn on verbose in the GridSearchCV. Also, You could add a limit to the number of iterations of the SVC, starting from a very small number, like 5, just to make sure that the problem is not with the convergence.

Sokolokki
  • 833
  • 1
  • 9
  • 19
  • Will it be possible that this will not converge at all? – chmscrbbrfck Jan 07 '19 at 15:09
  • There is a tolerance parameter, and the stopping condition is achieving the "difference" of the training and the true values less than the tolerance value. So, yes, if the model is bad (for whatever reason), it might fail converging. But using verbose You can see the details of training. – Sokolokki Jan 07 '19 at 15:13