0

I am trying to write a function that resamples names nested in groups. My function works for resampling without respect to groups, but I don't want to create samples of names that aren't in the same group.

Here's the function, where x is a vector of all names (some repeated), a is a vector of unique name observations, and b is a vector of unique names in randomized order.

    rep <- function(x,a,b){
      for(i in 1:length(a)){
        x1 <- x
        x1[which(x==a[i])] <- b[i]
      }
      x1
    }
x <- c("Smith", "Jones", "Washington", "Miller", "Wells", "Smith", "Smith", "Miller")
a <- sort(unique(x))
b <- sample(a, length(a))

dat <- rep(x, a, b)
View(dat)
"Smith"      "Jones"      "Washington" "Miller"     "Jones"      "Smith"      "Smith"       "Miller" 

However, each name is nested in a group, so I need to avoid creating samples of names that are not in the same group. For example:

x         groupid
Smith       A1
Jones       B1
Washington  C1
Miller      A2
Wells       B1
Smith       A2
Smith       A3
Miller      A3

How can I account for that?

1 Answers1

0

This would be easier to accomplish with the tidyverse packages:

library(tidyverse)

txt <- 'x         groupid
Smith       A1
Jones       B1
Washington  C1
Miller      A2
Wells       B1
Smith       A2
Smith       A3
Miller      A3'

df <- read_table(file = txt)

set.seed(0)
df.new <- df %>% 
  group_by(groupid) %>% 
  mutate(
    b = sample(unique(x), n(), replace = T)
  ) %>% 
  arrange(groupid)

  x          groupid b         
  <chr>      <chr>   <chr>     
1 Smith      A1      Smith     
2 Miller     A2      Miller    
3 Smith      A2      Smith     
4 Smith      A3      Smith     
5 Miller     A3      Miller    
6 Jones      B1      Wells     
7 Wells      B1      Jones     
8 Washington C1      Washington
jdobres
  • 11,339
  • 1
  • 17
  • 37
  • Thanks! That works, but it creates the same samples/order within groups every time. I forgot to include that this is inside a for loop and I want the order to be different each iteration. Is there a way to do that? – quantoid6969 Mar 07 '22 at 01:07
  • Just remove the set.seed line. That controls the random number generation. – jdobres Mar 07 '22 at 01:08
  • got it, thanks! this was super helpful. – quantoid6969 Mar 07 '22 at 01:11
  • This wasn't giving me exactly what I wanted last night, so I tried it again today but I can't get it to run without an error message. Specifically, it doesn't like the mismatch between the length of unique(x) and the size of df which is larger due to repeated x values. Do you know what I'm doing wrong? – quantoid6969 Mar 07 '22 at 22:09
  • In other words, `length(unique(x))` might equal 2, but you have 3 rows to fill in that group? In that case, you need to sample "with replacement". This means that the sample function could select the same name multiple times, but it's the only way to fill a larger bucket from a smaller sample. I've edited my solution to use `replace = T`. – jdobres Mar 07 '22 at 23:00