yesterday I already asked a similar question: R - Randomly split a dataframe in n equal pieces
The answer I got is nearly what I need, but there are still problems with it. Also I thought about different other ways to get a result.
This is my example df-list:
set.seed(0L)
AB_df = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_df = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
AB_pc = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_pc = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
df_list = list(AB_df, BC_df, DE_df, FG_df, AB_pc, BC_pc, DE_pc, FG_pc)
names(df_list) = c("AB_df", "BC_df", "DE_df", "FG_df", "AB_pc", "BC_pc", "DE_pc", "FG_pc")
I want to randomly subset the single df within the list into n equal parts (or as close as possible to equal). I already got a very helpful answer from chinsoon12:
new = lapply(df_list, function(df) {
n <- nrow(df)
split(df, cut(sample(n), seq(1, n, by=floor(n/4)), labels=FALSE, include.lowest=TRUE))})
The problem is that its not working for any number of rows and also not all observations are taken in account. E.g. when I devide my df_list in 5 subsets with that methode I am getting subsets of 325, 324, 324, 324, 324 for AB_df and in total thats not 1624, so something is missing. When I devide it into 4 pieces, I only get 3 subsets...any idea why this is happening?
I also thought about 2 different ways of splitting the df in the list. One way might be to just reorder the observations randomly by changing the order of the rows in a random way:
for (a in 1:length(df_list)) {
df_list[[a]] = df_list[[a]][sample(nrow(df_list[[a]])),]}
Now I would only need to devide the dfs into n pieces...but this is the point where I am not sure how to do that.
3rd way I thought of would be to create a random list of numbers 1:n for n-subsamples and add them to the dataframes and then extract the df according to the number.
I still think the first way is the easiest and I would prefer this. Any idea whats wrong with the code?