I want to understand how max_samples value for a Bagging classifier effects the number of samples being used for each of the base estimators.
This is the GridSearch output:
GridSearchCV(cv=5, error_score='raise',
estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1, spl... n_estimators=100, n_jobs=-1, oob_score=False,
random_state=1, verbose=2, warm_start=False),
fit_params={}, iid=True, n_jobs=-1,
param_grid={'max_features': [0.6, 0.8, 1.0], 'max_samples': [0.6, 0.8, 1.0]},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)
Here I am finding out what the best params were:
print gs5.best_score_, gs5.best_params_
0.828282828283 {'max_features': 0.6, 'max_samples': 1.0}
Now I am picking out the best grid search estimator and trying to see the number of samples that specific Bagging classifier used in its set of 100 base decision tree estimators.
val=[]
for i in np.arange(100):
x = np.bincount(gs5.best_estimator_.estimators_samples_[i])[1]
val.append(x)
print np.max(val)
print np.mean(val), np.std(val)
587
563.92 10.3399032877
Now, the size of training set is 891. Since CV is 5, 891 * 0.8 = 712.8 should go into each Bagging classifier evaluation, and since max_samples is 1.0, 891 * 0.5 * 1.0 = 712.8 should be the number of samples per each base estimator, or something close to it?
So, why is the number in the range 564 +/- 10, and maximum value 587, when as per calculation, it should be close to 712 ? Thanks.