Grid-search taking takes 30+ minutes is there any way of reducing this? (Jupyter Azure)

Question

I am carrying out a grid-search for a SVR design which has a time series split. My problem is the grid-search takes roughly 30+ minutes which is too long. I have a large data set consisting of 17,800 bits of data however, this duration is too long. Is there any way that I could reduce this duration? My code is:

from sklearn.svm import SVR
from sklearn.model_selection import TimeSeriesSplit
from sklearn import svm
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing as pre

X_feature = X_feature.reshape(-1, 1)
y_label = y_label.reshape(-1,1)

param = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
                       'C': [1, 10, 100, 1000]},
                       {'kernel': ['poly'], 'C': [1, 10, 100, 1000], 'degree': [1, 2, 3, 4]}] 


reg = SVR(C=1)
timeseries_split = TimeSeriesSplit(n_splits=3)
clf = GridSearchCV(reg, param, cv=timeseries_split, scoring='neg_mean_squared_error')


X= pre.MinMaxScaler(feature_range=(0,1)).fit(X_feature)

scaled_X = X.transform(X_feature)


y = pre.MinMaxScaler(feature_range=(0,1)).fit(y_label)

scaled_y = y.transform(y_label)



clf.fit(scaled_X,scaled_y )

My data for scaled y is:

 [0.11321139]
 [0.07218848]
 ...
 [0.64844211]
 [0.4926122 ]
 [0.4030334 ]]

And my data for scaled X is:

[[0.2681013 ]
 [0.03454225]
 [0.02062136]
 ...
 [0.92857565]
 [0.64930691]
 [0.20325924]]

score 2 · Answer 1 · answered Jul 04 '18 at 19:58

2

Use GridSearchCV(..., n_jobs=-1) in order to use all available CPU cores in parallel.

Alternatively you can use RandomizedSearchCV

answered Jul 04 '18 at 19:58

MaxU - stand with Ukraine

205,989
36
386
419

Just wanted to ask, if I am using Jupyter Azure would n_jobs=-1 still imply? Just curious, also as the code has been compiling for 25 minutes and nothing has happened – Asif.Khan Jul 04 '18 at 20:25
@Asif.Khan `n_jobs = -1` implies all processors are to be used. – Shayan Shafiq Feb 28 '21 at 13:55

score -1 · Answer 2 · answered Jul 04 '18 at 21:47

Depending on the datasize and the classifier, it might take a long time. Alternatively, you can try breaking the process into smaller parts, by using only one time of kernel at a time like this,

param_rbf = {'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
                   'C': [1, 10, 100, 1000]}

Then use it like this

clf = GridSearchCV(reg, param_rbf, cv=timeseries_split, scoring='neg_mean_squared_error')

Similarly, make predictions seperately for different kernels, by different params dictionary

params_poly = {'kernel': ['poly'], 'C': [1, 10, 100, 1000], 'degree': [1, 2, 3, 4]}

I know this is not exactly a solutions, but just a few suggestions to help you reduce time if possible.

Also, set the verbose option to True. This will help you show progress of the classifier.

Also, setting n_jobs=-1 might not necessarily lead to reduction in speed. See this answer for reference.

This is faster as the code is split but I think this will take a while regardless because of the amount of data. Thank you very much! @Mohammed Kashif — Asif.Khan, Jul 05 '18 at 11:01

Grid-search taking takes 30+ minutes is there any way of reducing this? (Jupyter Azure)

2 Answers2