6

Using sklearn to run a grid search on a random forest classifier. This has been running for longer than I thought, and I am trying to estimate how much time is left for this process. I thought the total number of fits it would do would be 3*3*3*3*5 = 405.

clf = RandomForestClassifier(n_jobs=-1, oob_score=True, verbose=1)
param_grid = {'n_estimators':[50,200,500],
'max_depth':[2,3,5],
'min_samples_leaf':[1,2,5],
'max_features': ['auto','log2','sqrt']
}

gscv = GridSearchCV(estimator=clf,param_grid=param_grid,cv=5)
gscv.fit(X.values,y.values.reshape(-1,))

From the output, I see it cycling through the tasks where each set is the number of estimators:

[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.2min
[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 5.3min
[Parallel(n_jobs=-1)]: Done 200 out of 200 tasks | elapsed: 6.2min finished
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.5s
[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 3.0s
[Parallel(n_jobs=8)]: Done 200 tasks out of 200 tasks | elapsed: 3.2s finished
[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.1min
[Parallel(n_jobs=-1)]: Done 50 tasks out of 50 tasks | elapsed: 1.5min finished
[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.5s
[Parallel(n_jobs=8)]: Done 50 out of 50 tasks | elapsed: 0.8s finished

I counted up the number of "finished" and it is at 680 currently. I thought it would be done at 405. Is my calculation wrong?

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
user4446237
  • 636
  • 8
  • 21

1 Answers1

11

Your calculation seems correct: the number of grids is the combinatoric product of the different parameters, which in this case is 81:

>>> from sklearn.model_selection import ParameterGrid

>>> pg = ParameterGrid(param_grid)
>>> len(pg)
81

Within each, you have five cross-validations, for a total of 405. The tasks is a separate indication entirely.

verbose gets passed through to a parent class BaseForest, and subsequently to joblib's Parallel.

I'm not precisely sure what constitutes a task in this case, but the number of top-level grid-train combinations should be 405. Keep in mind each of these is in turn an ensemble of trees.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • For tasks I'm seeing "out of 200 tasks", "out of 50 tasks", or "out of 500 tasks". It seems to line up to the number of trees. I just noticed that the n_jobs jumps back and forth between -1 and 8 ... not sure why that is, but perhaps I should expect 810 "finished"s? – user4446237 Mar 14 '18 at 17:52
  • Well, n_jobs=-1 will just map to the number of cores on your comp., which I'd guess is 8? As for the tasks ... again, not precisely sure what constitutes a task but it does seem to be separate just the grid size – Brad Solomon Mar 14 '18 at 18:05
  • Completed running. Ended up with 811 "finished". Regardless, your calculation is correct. – user4446237 Mar 15 '18 at 02:07
  • @user4446237 So coincidentally, I just ran into this same situation. Parameter grid of size 9, CV=4. You'd expact to have 9 * 4 = 36 total calls to .fit(), but there were 82 "Done 1000 out of 1000" statuses. (On a random forest with 1000 trees.) So, I'm not precisely sure why the difference. – Brad Solomon Mar 20 '18 at 20:59