7

I'm trying to create a bootstrapped sample from a multiindex dataframe in Pandas. Below is some code to generate the kind of data I need.

from itertools import product
import pandas as pd
import numpy as np

df = pd.DataFrame({'group1': [1, 1, 1, 2, 2, 3],
                       'group2': [13, 18, 20, 77, 109, 123],
                       'value1': [1.1, 2, 3, 4, 5, 6],
                       'value2': [7.1, 8, 9, 10, 11, 12]
                       })
df = df.set_index(['group1', 'group2'])

print df

The df dataframe looks like:

                   value1  value2
group1 group2                
1      13         1.1     7.1
       18         2.0     8.0
       20         3.0     9.0
2      77         4.0    10.0
       109        5.0    11.0
3      123        6.0    12.0

I want to get a random sample from the first index. For example let's say the random values np.random.randint(3,size=3) produces [3,2,2]. I'd like the resultant dataframe to look like:

                   value1  value2
group1 group2                
3      123        6.0    12.0
2      77         4.0    10.0
       109        5.0    11.0
2      77         4.0    10.0
       109        5.0    11.0

I've spent a lot of time researching this and I've been unable to find a similar example where the multiindex values are integers, the secondary index is of variable length, and the primary index samples are repeating. This is how I think an appropriate implementation for bootstrapping would work.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
Chris
  • 676
  • 5
  • 20

1 Answers1

3

Try:

df.unstack().sample(3, replace=True).stack()

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 2
    Doesn't work for large datasets: ValueError: Unstacked DataFrame is too big, causing int32 overflow – Amin Mar 04 '21 at 04:25
  • @Amin 4.5 year old answer. Ask a new one and mention large dataset. Request for memory and cpu efficiency – piRSquared Mar 04 '21 at 04:54