R subset stratify dataframe by group; subset the max amount of observations per group as long as true and false boolean are balanced: (python answer is also accepted)
I have a dataset of 10000 samples from 600 restaurant IDs, with some of the IDs missing and a biased boolean that I need to balance to 50:50 before I run any models. To recreate the dataset, here is the code
x<-floor(runif(10000, 0, 600)) #make a dataset of 10000 samples from 600 restaurant IDs
x<-sort(x)
y<-sample(0:1,10000,prob=c(.16,.84),replace=TRUE) #make a biased boolean for those 10000 samples
df = data.frame(x,y) #dataframe has random number of restaurants and biased boolean
colnames(df) <- c("Restaurant_ID","Restaurant_Bool")
summary(df)
nrow(df)
z<-floor(runif(10, 0, 600)) #create a 10 restaurants by ID that are missing from the dataset
for (i in 10) {
df<-df[!(df$Restaurant_ID==z[i]),] #remove those restaurants by ID from the dataset
}
summary(df)
nrow(df)
The ratio for true:false for the dataset is about 84:16, but that number also varies per restaurant ID
Similar to stratification by restaurant ID, I need to limit the amount of true observations to be equal to the number of false observations per restaurant ID
I have no idea how to code this and anything helps
so for example, for restaurant_ID 0, there could be 10 observations, where 8 has true and 2 has false. There is no restaurant_ID 1.
for restaurant_ID 2, there could be 8 observations, where 3 has true and 5 has false.
X restaurant_ID Restaurant_Bool
1 0 1
2 0 1
3 0 1
4 0 0
5 0 1
6 0 1
7 0 1
8 0 0
9 0 1
10 0 1
11 2 0
12 2 0
13 2 1
14 2 0
15 2 1
16 2 0
17 2 1
18 2 0
...
I want a subset result where the number of Restaurant_Bool == 0 is the same as the number of Restaurant_Bool == 1 as long as the maximum number of observations are subsetted based based on the minimum number of boolean observation per restaurant_ID
X restaurant_ID Restaurant_Bool
1 0 1
2 0 1
4 0 0
8 0 0
11 2 0
12 2 0
13 2 1
15 2 1
16 2 0
17 2 1
...
This could be the first subset, and another subset could use the other observations to randomly recreate another subset with the same rule:
X restaurant_ID Restaurant_Bool
6 0 1
7 0 1
4 0 0
8 0 0
14 2 0
18 2 0
13 2 1
15 2 1
16 2 0
17 2 1
...
...and so on, where multiple different subsets from the same data set could be created by keeping the same sample number of Restaurant_Bool == 1 as Restaurant_Bool == 0 per restaurant_ID
In the rare chance that Restaurant_Bool == 0 has more observations than Restaurant_Bool == 1, then use the least represented boolean to recreate the dataset per restaurant ID, to where a whole restaurant ID could be removed from the dataset if either true or false has no observations
The reason that I want to stratify by restaurant_ID is that there could be some internal correlation with the rest of the columns that I need to preserve when making my model
The closest answer I found is this Subset panel data by group , but does not take into effect that I want to keep the max amount of observations per restaurant_ID as long as true and false boolean are balanced