0

I have a dataframe (length 4 data points) and want to do a Bootstrap X times.

DATA FRAME EXAMPLE:

              Index A B
                0   1 2
                1   1 2
                2   1 2
                3   1 2 

I figured out this code for the Bootstrap Resampling

      boot = resample(df, replace=True, n_samples=len(df), random_state=1)
      print('Bootstrap Sample: %s' % boot)

but now i like to repeat this X times. How can i do this?

output for x=20.

  Sample Nr.    Index A B
      1         0   1 2
                1   1 2
                2   1 2
                3   1 2 
     ...
      20        0   1 2
                1   1 2
                1   1 2
                2   1 2   

Thank you guys.

Best

  • you mean like I want to get n different bootstrap samples from my data? – Miguel Trejo Oct 28 '20 at 14:27
  • yes thats right @MiguelTrejo. The code above is only able to create 1 bootstrap sample. But i would like to get X many (like maybe >1000). Thank you very much – TheRealSanity Oct 28 '20 at 14:29
  • do you mean the `sample` function or the `resample` function?, the params you specify are for the sample function? – Miguel Trejo Oct 28 '20 at 14:32
  • For the resample function. So to explain more clearly: 1) We have the original data 2) create a X times of this original data in resampled data. 2) the code: boot = resample(df, replace=True, n_samples=len(df), random_state=1) print('Bootstrap Sample: %s' % boot) Creates only 1 resampled data from the original data. --> so create more resampled data from the original data is the goal (resampling with repetition). @MiguelTrejo – TheRealSanity Oct 28 '20 at 14:34

2 Answers2

2

Approach 1: Sample Data Parallely

As it could be time consuming to be calling n time the sample method of a dataframe, one can consider to apply the sample method parallely.

import multiprocessing
from itertools import repeat

def sample_data(df, replace, random_state):
    '''Generate one sample of size len(df)'''
    return df.sample(replace=replace, n=len(df), random_state=random_state)

def resample_data(df, replace, n_samples, random_state):
    '''Call n_samples time the sample method parallely'''
    
    # Invoke lambda in parallel
    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    bootstrap_samples = pool.starmap(sample_data, zip(repeat(df, n_samples), repeat(replace), repeat(random_state)))
    pool.close()
    pool.join()

    return bootstrap_samples

Now, if I want to generate 15 samples, resample_data will return me a list with 15 samples from the df.

samples = resample_data(df, True, n_samples=15, random_state=1)

Notice that to return different results it will be convenient to set random_state to None.

Approach 2: Sample Data Linearly

Another approach to sample data is through a list comprehension, as the function sample_data is already defined, it is straightforward to call it inside the list.

def resample_data_linearly(df, replace, n_samples, random_state):
    
    return [sample_data(df, replace, random_state) for _ in range(n_samples)] 

# Generate 10 samples of size len(df)
samples = resample_data_linearly(df, True, n_samples=10, random_state=1)
Miguel Trejo
  • 5,913
  • 5
  • 24
  • 49
  • Thank you very much. But the output seem to create a lot of errors: (An attempt has been made to start a new process before the current process has finished its bootstrapping phase. This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module:) for instance. @MiguelTrejo – TheRealSanity Oct 28 '20 at 15:14
  • and also the length of the new sample has to be the same as the original one. (so the n_samples = 15 doesnt do 15 new samples but create a sample with only 15 data points from the original sample to create one new sample. – TheRealSanity Oct 28 '20 at 15:17
  • 1
    It seems that problem is for Windows, perhaps you can use a docker container to run your code, this works fine on Linux – Miguel Trejo Oct 28 '20 at 15:20
  • 1
    You're right if the sample size should be the data size it can be changed, I'll make the edit. – Miguel Trejo Oct 28 '20 at 15:22
  • Okay i will try. Thank you very much @MiguelTrejo. But do you have maybe a more easier approach? I have edited the question above with dataframe examples. I just want to repeat the code (20 times) for example. – TheRealSanity Oct 28 '20 at 15:23
  • 1
    @LinhTran you could use a list comprehension, see the last edit for a code example. I hope this is works for you. – Miguel Trejo Oct 28 '20 at 15:37
  • Thank you very much @MiguelTrejo ! But it is possible to get the output (each resample) as a dataframe and not a list maybe? – TheRealSanity Oct 28 '20 at 15:50
  • yes, just apply a concat on the result `pd.concat(resample_data_linearly(df, True, n_samples=10, random_state=1))` – Miguel Trejo Oct 28 '20 at 15:56
  • Great! Thanks to get help in this. I have a last question: when you have time i would be highly thankful if not it is also okay. Now i want to Calculate (Row 'A' + Row 'B' / len(df)) for each resample package. How is this possible? Since pd.concat creates one big new dataframe with no indication of the data sample Nr. i am struggling with this right now. Thanks. – TheRealSanity Oct 28 '20 at 16:08
  • maybe you can help me a liitle with this? Thank you – TheRealSanity Oct 29 '20 at 14:13
  • @LinhTran can you more specific with your second question? for each sample you want to calculate (Row 'A' + Row 'B' / len(df))?, which will be Row A and Row B? – Miguel Trejo Oct 30 '20 at 18:31
0

Just want to add another approach that uses numpy.random.Generator.choice. This approach will work whether your data is a numpy array or pandas dataframe.

Using the sample of data your provided

df = pd.DataFrame({'index': [0, 1, 2, 3],
                  'A': [1, 1, 1, 1],
                  'B': [2, 2, 2, 2]})
df

Here is how I would do it with using the numpy approach

rng = np.random.default_rng()

def simple_bootstrap(data, replace=True, replicates=5, random_state=None, shuffle=True):
    def simple_resample(data, size=len(data), replace=replace, shuffle=shuffle, axis=0):
        return rng.choice(a=data, size=size, axis=axis)
    return [simple_resample(data) for _ in range(replicates)]

When I call the function on my df like below, it gives me 4 random selections from my data

simple_bootstrap(df)

[array([[1, 1, 2],
        [2, 1, 2],
        [0, 1, 2],
        [3, 1, 2]], dtype=int64),
 array([[0, 1, 2],
        [1, 1, 2],
        [1, 1, 2],
        [3, 1, 2]], dtype=int64),
 array([[3, 1, 2],
        [1, 1, 2],
        [1, 1, 2],
        [2, 1, 2]], dtype=int64),
 array([[3, 1, 2],
        [1, 1, 2],
        [3, 1, 2],
        [3, 1, 2]], dtype=int64),
 array([[0, 1, 2],
        [3, 1, 2],
        [3, 1, 2],
        [3, 1, 2]], dtype=int64)]

Remember, although I asked for replicates=5, it got 4 random samples, because If a has more than one dimension, the size shape will be inserted into the axis dimension, so the output ndim will be a.ndim - 1 + len(size).

You could also extend your bootstrap function to include a statistical function that runs over each replication and saves it into a list, like the example below:

def simple_bootstrap(data, statfunction, replace=True, replicates=5, random_state=None, shuffle=True):
    def simple_resample(data, size=len(data), replace=replace, shuffle=shuffle, axis=0):
        return rng.choice(a=data, size=size, axis=axis)
    resample_estimates = [statfunction(simple_resample(data)) for _ in range(replicates)]
    return resample_estimates
GSA
  • 751
  • 8
  • 12