3

I have a dataset as such:

data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))

     ID  Group
1    A1 Group1
2    A6 Group1
3    A3 Group1
4   A55 Group1
5    BC Group1
6    J5 Group2
7    Ca Group2
8   KQF Group2
9    FK Group2
10 AAAA Group2
11 ABBd Group1
12 XXF Group2

How can I create two sub-dataframes from the above data such that there are no repeats and there are exactly the same number of elements from Group1 and Group2 in each sub-dataframe? Both sub-dataframes combined together are always identical to the original dataframe.

ID is always unique.

EXAMPLE RESULT

subDF1
     ID  Group
1    A1 Group1
4   A55 Group1
11 ABBd Group1
6    J5 Group2
8   KQF Group2
9    FK Group2

subDF2
     ID  Group
2    A6 Group1
3    A3 Group1
5    BC Group1
7    Ca Group2
10 AAAA Group2
12  XXF Group2
  • Equal number of elements in subDF1 and subDF2
  • Equal proportion of elements from Group1 and Group2
  • Elements in subDF1 should not be in subDF2 and vice-versa
paropunam
  • 488
  • 2
  • 11

4 Answers4

2

We can use sample_n after applying distinct

df1 %>% 
  distinct %>% 
  group_by(Group) %>% 
  sample_n(2)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    @paropunam. There would be edge cases though. For eg. if you have only duplicates for a group while other group have more than one unique element and you specify n = 3 or so, one of the group doesn't have the required number of elements – akrun Jul 31 '19 at 14:19
  • @paropunam If you need that those two sub-dataframes would not have common elements among them, this solution won't work for you – Poiu Rewq Jul 31 '19 at 14:56
  • Perhaps my initial explanation in the main post was not clear, but your solution is entirely incorrect. It only returns `ONE` sub-df containing `TWO` elements from each group contained in the original dataframe. But I need `TWO` sub-dataframes where each contains elements from the original dataframe with the same ratio of elements from `Group1` and `Group2`. Additionally, elements already in `subDF1` should not be in `subDF2`. – paropunam Aug 01 '19 at 12:56
0

So I made my version of the solution based on assumption that you need two sub-dataframes that not only have the same amount of elements from each group but are also made of completely different rows of the main dataframe:

# This function returns the list with two required sub-dataframes
split_df <- function(df, n){
# First of all let's check if you want to cut an appropriately sized slice from groups  
if (any(table(df$Group) <= n*2)){
  return("Your N is too big for a given number of elements in some group(s)")
}
# Then we sample n elements from each group for the first time
sub1 <- unlist(tapply(1:nrow(df), df$Group, function(x){
  sample(x, n)
}))
# Make a new dataframe that has no rows that we subsetted on the prev step
df_2 <- df[-sub1,]
# Subset second time
sub2 <- unlist(tapply(1:nrow(df_2), df_2$Group, function(x){
  sample(x, n)
}))
# And return the list with resulting sub-dfs
return(
  list(
    df[sub1,],
    df_2[sub2,]
  )
)

}
Poiu Rewq
  • 182
  • 13
0

OK. I believe this is the correct way to do it. This will work well even if there are an odd number of elements in one group (or even both).

x <- data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), 
            Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))

x$SubDF <- NA
x[which(x$Group == "Group1"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group1"]/2), 
                                               size = length(which(x$Group == "Group1")), replace = ifelse(test = table(x$Group)["Group1"] %% 2 != 0, yes = TRUE, FALSE))
x[which(x$Group == "Group2"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group2"]/2), 
                                               size = length(which(x$Group == "Group2")), replace = ifelse(test = table(x$Group)["Group2"] %% 2 != 0, yes = TRUE, FALSE))

subDF1 <- x %>% dplyr::filter(SubDF == "SubDF1") %>% dplyr::select(-SubDF)
subDF2 <- x %>% dplyr::filter(SubDF == "SubDF2") %>% dplyr::select(-SubDF)
> subDF1
    ID  Group
1   A3 Group1
2   BC Group1
3   J5 Group2
4   FK Group2
5 AAAA Group2
6 ABBd Group1

> subDF2
   ID  Group
1  A1 Group1
2  A6 Group1
3 A55 Group1
4  Ca Group2
5 KQF Group2
6 XXF Group2
paropunam
  • 488
  • 2
  • 11
0

I am actually not quite sure that this will suffice but here it is,

library(dplyr)

df %>% 
 mutate(new = rep(seq(n() / 2), 2)) %>% 
 arrange_at(vars(3:2)) %>% 
 mutate(new1 = rep(seq(2), each = max(new))) %>% 
 split(.$new1)

which gives,

$`1`
   ID  Group new new1
1  A1 Group1   1    1
2  Ca Group2   1    1
3  A6 Group1   2    1
4 KQF Group2   2    1
5  A3 Group1   3    1
6  FK Group2   3    1

$`2`
     ID  Group new new1
7   A55 Group1   4    2
8  AAAA Group2   4    2
9    BC Group1   5    2
10 ABBd Group1   5    2
11   J5 Group2   6    2
12  XXF Group2   6    2
Sotos
  • 51,121
  • 6
  • 32
  • 66