What is stratified bootstrap?

Question

I have learned bootstrap and stratification. But what is stratified bootstrap? And how does it work?

Let's say we have a dataset of n instances (observations), and m is the number of classes. How should I divide the dataset, and what's the percentage for training and testing?

score 9 · Accepted Answer · edited Oct 01 '19 at 05:25

9

You split your dataset per class. Afterwards, you sample from each sub-population independently. The number of instances you sample from one sub-population should be relative to its proportion.

 data
 d(i) <- { x in data | class(x) =i }
 for each class
    for j = 0..samplesize*(size(d(i))/size(data))
       sample(i) <- draw element from d(i)
 sample <- U sample(i)

If you sample four elements from a dataset with classes {'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b'}, this procedure makes sure that at least one element of class b is contained in the stratified sample.

edited Oct 01 '19 at 05:25

Trisoloriansunscreen

1,543
1
15
27

answered Feb 10 '16 at 23:52

CAFEBABE

3,983
1
19
38

Does `samplesize` equal to classes in dataset, or instances in the dataset? – Kevin217 Feb 10 '16 at 23:59

score 1 · Answer 2 · edited Jun 05 '21 at 08:29

Just had to implement this in python, I will just post my current approach here in case this is of interest for others.

Function to create index for original Dataframe to create stratified bootstrapped sample

I chose to iterate over all relevant strata clusters in the original Dataframe , retrieve the index of the relevant rows in each stratum and randomly (with replacement) draw the same amount of samples from the stratum that this very stratum consists of.

In turn, the randomly drawn indices can just be combined into one list (that should in the end have the same length as the original Dataframe).

import pandas as pd
from random import choices

def provide_stratified_bootstap_sample_indices(bs_sample):

    strata = bs_sample.loc[:, "STRATIFICATION_VARIABLE"].value_counts()
    bs_index_list_stratified = []

    for idx_stratum_var, n_stratum_var in strata.iteritems():

        data_index_stratum = list(bs_sample[bs_sample["STRATIFICATION_VARIABLE"] == idx_stratum_var[0]].index)
        bs_index_list_stratified.extend(choices(data_index_stratum , k = len(data_index_stratum )))

    return bs_index_list_stratified

And then the actual bootstrapping loop

(say 10k times):

k=10000

for i in range(k):
    bs_sample = DATA_original.copy()

    bs_index_list_stratified = provide_stratified_bootstap_sample_indices(bs_sample)
    bs_sample = bs_sample.loc[bs_index_list_stratified , :]

    # process data with some statistical operation as required and save results as required for each iteration
    RESULTS = FUNCTION_X(bs_sample)

What is stratified bootstrap?

2 Answers2

Function to create index for original Dataframe to create stratified bootstrapped sample

And then the actual bootstrapping loop