8

I have a data frame containing information about a population that i wish to generate a sample from. I also have a dataframe sample_info that details how many units of each group in the population dataframe I need in my sample. I have developed some code that achieves what i need but it runs slower than i would like given the large datasets i am working with.

Is there a way to group the population frame and apply sampling to the groups rather than looping through them as i have done below?

import pandas as pd

population = pd.DataFrame([[1,True],[1,False],[1,False],[2,True],[2,True],[2,False],[2, True]], columns = ['Group ID','Response'])

    Group ID    Response
0   1           True
1   1           False
2   1           False
3   2           True
4   2           True
5   2           False
6   2           True

sample_info = pd.DataFrame([[1,5],[2,6]], columns = ['Group ID','Sample Size'])

output = pd.DataFrame(columns = ['Group ID','Response'])

    Group ID    Sample Size
0   1           5
1   2           6


for index, row in sample_info.iterrows():    
        output = output.append(population.loc[population['Group ID'] == row['Group ID']].sample(n=row['Sample Size'], replace = True)) 

I couldn't figure out to bring in the sample size information using group-by and apply as suggested in Pandas: sample each group after groupby

Vaishali
  • 37,545
  • 5
  • 58
  • 86
Ryan
  • 142
  • 6

2 Answers2

7

Convert sample_info to dictionary. Group population by Group ID. Pass the sample size values to DataFrame.sample using the dictionary.

mapper = sample_info.set_index('Group ID')['Sample Size'].to_dict()

population.groupby('Group ID').apply(lambda x: x.sample(n=mapper.get(x.name))).reset_index(drop = True)
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • 1
    Thank you, this works perfectly once i added in replace = True. – Ryan Apr 04 '19 at 20:07
  • You are welcome, I assumed your data would be large so you would not get the error sample size is bigger. – Vaishali Apr 04 '19 at 20:08
  • Frustrating that this is the answer (though a good answer). I went down the route of joining and then trying something vectorized on the sampling, but it appears it does have to be done with `apply`. It seemed like a common enough thing to do that there would be a faster way :) – roganjosh Apr 04 '19 at 20:09
  • @roganjosh, this is the most idiomatic way of sampling by groups. The only difference here was fetching the sample size from another data frame. :) – Vaishali Apr 04 '19 at 20:10
  • I feel like maybe for loop is faster than apply @roganjosh – BENY Apr 04 '19 at 20:16
  • @Wen-Ben iterating each group rather than the whole DF? `apply` will iterate in a Python `for` loop anyway, but maybe with some overhead, right? – roganjosh Apr 04 '19 at 20:20
  • We can ask OP do find the difference for us :) @Ryan, can you please tell us about the performance of this code vs the one using for loop? – Vaishali Apr 04 '19 at 20:21
  • @roganjosh let me think more , maybe we just need get the index we need , and all we need to do is reindex – BENY Apr 04 '19 at 20:21
  • @Wen-Ben On second glance, I really like that train of thought. So, maybe the `join` to get the sample size into a single DF and then going off the index, using sample size, could do it – roganjosh Apr 04 '19 at 20:33
  • @Vaishali I will try and produce some performance metrics tommorrow, as i've finished for the day now. While i cant share the data directly, i will give a brief summary so you know what i'm working with as in reality as its more complicated than the above. My actual population dataframe consists of around 30k observations divided into 200 groups which are being resampled with replacement (30k sample size, each group 20%-500% sampled). this sampling is repeated around 10k times with an agrregation being performed on each sample and the results stored in a list. – Ryan Apr 04 '19 at 21:38
  • In case, someone wants groupBy multiple columns, the same will change as follows: `mapper = sample_info.set_index(['col1','col2'])['Sample Size'].to_dict() population.groupby(['col1','col2']).apply(lambda x: x.sample(n=mapper.get(x.name))).reset_index(drop = True)` – Anand Sonawane Feb 24 '22 at 02:36
3

I am not sure about the speed but sample the index looks like save the memory at least

d=population.groupby('Group ID').groups
a=np.concatenate([np.random.choice(d[x],y) for x, y in zip(sample_info['Group ID'],sample_info['Sample Size']) ])
population.loc[a]
Out[83]: 
   Group ID  Response
1         1     False
1         1     False
2         1     False
0         1      True
1         1     False
3         2      True
5         2     False
3         2      True
4         2      True
5         2     False
5         2     False
BENY
  • 317,841
  • 20
  • 164
  • 234