How many combinations will GridSearchCV run for this?

Question

Using sklearn to run a grid search on a random forest classifier. This has been running for longer than I thought, and I am trying to estimate how much time is left for this process. I thought the total number of fits it would do would be 3*3*3*3*5 = 405.

clf = RandomForestClassifier(n_jobs=-1, oob_score=True, verbose=1)
param_grid = {'n_estimators':[50,200,500],
'max_depth':[2,3,5],
'min_samples_leaf':[1,2,5],
'max_features': ['auto','log2','sqrt']
}

gscv = GridSearchCV(estimator=clf,param_grid=param_grid,cv=5)
gscv.fit(X.values,y.values.reshape(-1,))

From the output, I see it cycling through the tasks where each set is the number of estimators:

[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.2min
[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 5.3min
[Parallel(n_jobs=-1)]: Done 200 out of 200 tasks | elapsed: 6.2min finished
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.5s
[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 3.0s
[Parallel(n_jobs=8)]: Done 200 tasks out of 200 tasks | elapsed: 3.2s finished
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.1min
[Parallel(n_jobs=-1)]: Done 50 tasks out of 50 tasks | elapsed: 1.5min finished
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.5s
[Parallel(n_jobs=8)]: Done 50 out of 50 tasks | elapsed: 0.8s finished

I counted up the number of "finished" and it is at 680 currently. I thought it would be done at 405. Is my calculation wrong?

Brad Solomon · Accepted Answer · 2018-03-14T17:42:31.170

11

Your calculation seems correct: the number of grids is the combinatoric product of the different parameters, which in this case is 81:

>>> from sklearn.model_selection import ParameterGrid

>>> pg = ParameterGrid(param_grid)
>>> len(pg)
81

Within each, you have five cross-validations, for a total of 405. The tasks is a separate indication entirely.

verbose gets passed through to a parent class BaseForest, and subsequently to joblib's Parallel.

I'm not precisely sure what constitutes a task in this case, but the number of top-level grid-train combinations should be 405. Keep in mind each of these is in turn an ensemble of trees.

edited Mar 14 '18 at 17:42

answered Mar 14 '18 at 17:33

Brad Solomon

38,521
31
149
235

For tasks I'm seeing "out of 200 tasks", "out of 50 tasks", or "out of 500 tasks". It seems to line up to the number of trees. I just noticed that the n_jobs jumps back and forth between -1 and 8 ... not sure why that is, but perhaps I should expect 810 "finished"s? – user4446237 Mar 14 '18 at 17:52
Well, n_jobs=-1 will just map to the number of cores on your comp., which I'd guess is 8? As for the tasks ... again, not precisely sure what constitutes a task but it does seem to be separate just the grid size – Brad Solomon Mar 14 '18 at 18:05
Completed running. Ended up with 811 "finished". Regardless, your calculation is correct. – user4446237 Mar 15 '18 at 02:07
@user4446237 So coincidentally, I just ran into this same situation. Parameter grid of size 9, CV=4. You'd expact to have 9 * 4 = 36 total calls to .fit(), but there were 82 "Done 1000 out of 1000" statuses. (On a random forest with 1000 trees.) So, I'm not precisely sure why the difference. – Brad Solomon Mar 20 '18 at 20:59

How many combinations will GridSearchCV run for this?

1 Answers1