Randomly take equal number of elements from two groups -- create two sub-dataframes from one dataframe with equal number of elements

Question

I have a dataset as such:

data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))

     ID  Group
1    A1 Group1
2    A6 Group1
3    A3 Group1
4   A55 Group1
5    BC Group1
6    J5 Group2
7    Ca Group2
8   KQF Group2
9    FK Group2
10 AAAA Group2
11 ABBd Group1
12 XXF Group2

How can I create two sub-dataframes from the above data such that there are no repeats and there are exactly the same number of elements from Group1 and Group2 in each sub-dataframe? Both sub-dataframes combined together are always identical to the original dataframe.

ID is always unique.

EXAMPLE RESULT

subDF1
     ID  Group
1    A1 Group1
4   A55 Group1
11 ABBd Group1
6    J5 Group2
8   KQF Group2
9    FK Group2

subDF2
     ID  Group
2    A6 Group1
3    A3 Group1
5    BC Group1
7    Ca Group2
10 AAAA Group2
12  XXF Group2

Equal number of elements in subDF1 and subDF2
Equal proportion of elements from Group1 and Group2
Elements in subDF1 should not be in subDF2 and vice-versa

score 2 · Answer 1 · answered Jul 31 '19 at 14:15

2

We can use sample_n after applying distinct

df1 %>% 
  distinct %>% 
  group_by(Group) %>% 
  sample_n(2)

answered Jul 31 '19 at 14:15

akrun

874,273
37
540
662

1

@paropunam. There would be edge cases though. For eg. if you have only duplicates for a group while other group have more than one unique element and you specify n = 3 or so, one of the group doesn't have the required number of elements – akrun Jul 31 '19 at 14:19
@paropunam If you need that those two sub-dataframes would not have common elements among them, this solution won't work for you – Poiu Rewq Jul 31 '19 at 14:56
Perhaps my initial explanation in the main post was not clear, but your solution is entirely incorrect. It only returns `ONE` sub-df containing `TWO` elements from each group contained in the original dataframe. But I need `TWO` sub-dataframes where each contains elements from the original dataframe with the same ratio of elements from `Group1` and `Group2`. Additionally, elements already in `subDF1` should not be in `subDF2`. – paropunam Aug 01 '19 at 12:56

score 0 · Answer 2 · answered Jul 31 '19 at 15:22

So I made my version of the solution based on assumption that you need two sub-dataframes that not only have the same amount of elements from each group but are also made of completely different rows of the main dataframe:

# This function returns the list with two required sub-dataframes
split_df <- function(df, n){
# First of all let's check if you want to cut an appropriately sized slice from groups  
if (any(table(df$Group) <= n*2)){
  return("Your N is too big for a given number of elements in some group(s)")
}
# Then we sample n elements from each group for the first time
sub1 <- unlist(tapply(1:nrow(df), df$Group, function(x){
  sample(x, n)
}))
# Make a new dataframe that has no rows that we subsetted on the prev step
df_2 <- df[-sub1,]
# Subset second time
sub2 <- unlist(tapply(1:nrow(df_2), df_2$Group, function(x){
  sample(x, n)
}))
# And return the list with resulting sub-dfs
return(
  list(
    df[sub1,],
    df_2[sub2,]
  )
)

}

Unfortunately, the resulting sub-dataframes together, when combined, do not contain all of the elements of original dataframe. — paropunam, Aug 01 '19 at 13:05

paropunam · Accepted Answer · 2019-08-01T14:02:54.700

OK. I believe this is the correct way to do it. This will work well even if there are an odd number of elements in one group (or even both).

x <- data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), 
            Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))

x$SubDF <- NA
x[which(x$Group == "Group1"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group1"]/2), 
                                               size = length(which(x$Group == "Group1")), replace = ifelse(test = table(x$Group)["Group1"] %% 2 != 0, yes = TRUE, FALSE))
x[which(x$Group == "Group2"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group2"]/2), 
                                               size = length(which(x$Group == "Group2")), replace = ifelse(test = table(x$Group)["Group2"] %% 2 != 0, yes = TRUE, FALSE))

subDF1 <- x %>% dplyr::filter(SubDF == "SubDF1") %>% dplyr::select(-SubDF)
subDF2 <- x %>% dplyr::filter(SubDF == "SubDF2") %>% dplyr::select(-SubDF)

> subDF1
    ID  Group
1   A3 Group1
2   BC Group1
3   J5 Group2
4   FK Group2
5 AAAA Group2
6 ABBd Group1

> subDF2
   ID  Group
1  A1 Group1
2  A6 Group1
3 A55 Group1
4  Ca Group2
5 KQF Group2
6 XXF Group2

score 0 · Answer 4 · answered Aug 01 '19 at 14:00

I am actually not quite sure that this will suffice but here it is,

library(dplyr)

df %>% 
 mutate(new = rep(seq(n() / 2), 2)) %>% 
 arrange_at(vars(3:2)) %>% 
 mutate(new1 = rep(seq(2), each = max(new))) %>% 
 split(.$new1)

which gives,

$`1`
   ID  Group new new1
1  A1 Group1   1    1
2  Ca Group2   1    1
3  A6 Group1   2    1
4 KQF Group2   2    1
5  A3 Group1   3    1
6  FK Group2   3    1

$`2`
     ID  Group new new1
7   A55 Group1   4    2
8  AAAA Group2   4    2
9    BC Group1   5    2
10 ABBd Group1   5    2
11   J5 Group2   6    2
12  XXF Group2   6    2

Randomly take equal number of elements from two groups -- create two sub-dataframes from one dataframe with equal number of elements

4 Answers4