7

I want to understand how max_samples value for a Bagging classifier effects the number of samples being used for each of the base estimators.

This is the GridSearch output:

GridSearchCV(cv=5, error_score='raise',
       estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, spl... n_estimators=100, n_jobs=-1, oob_score=False,
         random_state=1, verbose=2, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_features': [0.6, 0.8, 1.0], 'max_samples': [0.6, 0.8, 1.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=2)

Here I am finding out what the best params were:

print gs5.best_score_, gs5.best_params_
0.828282828283 {'max_features': 0.6, 'max_samples': 1.0}

Now I am picking out the best grid search estimator and trying to see the number of samples that specific Bagging classifier used in its set of 100 base decision tree estimators.

val=[]
for i in np.arange(100):
    x = np.bincount(gs5.best_estimator_.estimators_samples_[i])[1]
    val.append(x)
print np.max(val)
print np.mean(val), np.std(val)

587
563.92 10.3399032877

Now, the size of training set is 891. Since CV is 5, 891 * 0.8 = 712.8 should go into each Bagging classifier evaluation, and since max_samples is 1.0, 891 * 0.5 * 1.0 = 712.8 should be the number of samples per each base estimator, or something close to it?

So, why is the number in the range 564 +/- 10, and maximum value 587, when as per calculation, it should be close to 712 ? Thanks.

hkhare
  • 225
  • 3
  • 10

1 Answers1

6

After doing more research, I think I've figured out what's going on. GridSearchCV uses cross-validation on the training data to determine the best parameters, but the estimator it returns is fit on the entire training set, not one of the CV-folds. This makes sense because more training data is usually better.

So, the BaggingClassifier you get back from GridSearchCV is fit to the full dataset of 891 data samples. It's true then, that with max_sample=1., each base estimator will randomly draw 891 samples from the training set. However, by default samples are drawn with replacement, so the number of unique samples will be less than the total number of samples due to duplicates. If you want to draw without replacement, set the bootstrap keyword of BaggingClassifier to false.

Now, exactly how close should we expect the number of distinct samples to be to the size of the dataset when drawing without replacement?

Based off this question, the expected number of distinct samples when drawing n samples with replacement from a set of n samples is n * (1-(n-1)/n) ^ n. When we plug 891 into this, we get

>>> 891 * (1.- (890./891)**891)
563.4034437025824

The expected number of samples (563.4) is very close to your observed mean (563.8), so it appears that nothing abnormal is going on.

Community
  • 1
  • 1
bpachev
  • 2,162
  • 15
  • 17
  • The reason I'm a bit confused is because I expect the max_features and max_samples keywords to work similarly. When I use estimators_features_ to see which all features were used to train the 100 base decision tree estimators, I see that all 100 trees used a subset of 9 features each, and since my dataset has 16 features, and 0.6 * 16 = 9.6, it makes sense to have 9 features as the maximum value. But there is no tree with less than 9 features, all have 9. Now in case of samples, similarly, either all should use a random subset of 712 samples, or I expected the number to be closer to 712. – hkhare Aug 05 '16 at 06:05
  • The problem is that samples are drawn with replacement by default. – bpachev Aug 05 '16 at 17:05
  • OK. I did some more research, and it turns out that GridSearchCV is returning an estimator trained on the full dataset of 891 points. Also, when doing sampling with replacement, you get quite a few duplicates. See my re-written answer for the details. – bpachev Aug 05 '16 at 19:15
  • Thanks for the explanation! It answers my question perfectly. Also, I tried bootstrap=False, and the GS gave me 0.6 as the best estimator's max_samples value. On re-running the bincount snippet, I get a value of 534 +/- 0 for the number of samples picked over the 100 base estimators, which totally agrees with the expectation of a constant number of samples being picked in case of 'without replacement'. (891 * 0.6 = 534.6) – hkhare Aug 06 '16 at 06:45