I have a data frame containing information about a population that i wish to generate a sample from. I also have a dataframe sample_info
that details how many units of each group in the population
dataframe I need in my sample. I have developed some code that achieves what i need but it runs slower than i would like given the large datasets i am working with.
Is there a way to group the population frame and apply sampling to the groups rather than looping through them as i have done below?
import pandas as pd
population = pd.DataFrame([[1,True],[1,False],[1,False],[2,True],[2,True],[2,False],[2, True]], columns = ['Group ID','Response'])
Group ID Response
0 1 True
1 1 False
2 1 False
3 2 True
4 2 True
5 2 False
6 2 True
sample_info = pd.DataFrame([[1,5],[2,6]], columns = ['Group ID','Sample Size'])
output = pd.DataFrame(columns = ['Group ID','Response'])
Group ID Sample Size
0 1 5
1 2 6
for index, row in sample_info.iterrows():
output = output.append(population.loc[population['Group ID'] == row['Group ID']].sample(n=row['Sample Size'], replace = True))
I couldn't figure out to bring in the sample size information using group-by and apply as suggested in Pandas: sample each group after groupby