15

I want to randomize/shuffle a vector. Some of the vector elements are identical. After shuffling, identical elements should have a minimum distance of three (i.e. two other elements should be between identical elements).

Consider the following example vector in R:

x <- rep(LETTERS[1:5], 3)  # Create example vector
x
#  [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"

If I shuffle my vector using the sample function, some of the identical elements may be too close together. For instance, if I use the following R code, the element "C" appears directly after each other at positions 5 and 6:

set.seed(53135)
sample(x)                  # sample() function puts same elements too close
#  [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"

How could I ensure that identical elements have a minimum distance of three?

Joachim Schork
  • 2,025
  • 3
  • 25
  • 48
  • 5
    You can always do rejection sampling. Depending on your input vector that could be very time-intensive though. – Roland Nov 16 '20 at 10:26
  • 6
    Please provide more information about the vector you attempt to shuffle. Just stating that "some of the vector elements are identical" is not enough as your sampling procedure may run into a dead end. For instance, if you have a vector of 9 elements with 4 of them being identical to each other, then you can never get a sample satisfying your positional constraint. At a minimum, we need to know the frequency count for each type of elements in your vector. – ekoam Nov 16 '20 at 11:24
  • @Roland Thank you very much for the hint on rejection sampling. This might be a bit of an overkill though. I was hoping for a function that has an argument such as min.distance. Anyway, if such a function doesn't exist I'll try to do it with rejection sampling, so thank you very much! – Joachim Schork Nov 16 '20 at 12:49
  • @ekoam Thanks for getting back to me! I want to apply this "shuffling method" to different vectors, so I cannot tell you the exact length. However, I can tell you that my vectors will have a length of around 100-200 elements and all of the elements will have at least one twin. The element with the most occurences will exist about 10 times in the vector. I hope that clarifies my question! – Joachim Schork Nov 16 '20 at 12:54
  • @JoachimSchork is there something wrong with the answers? if so feel free to point it out – Abdessabour Mtk Nov 24 '20 at 22:32
  • @AbdessabourMtk I'm sorry for the late response! (see my comment below your answer) – Joachim Schork Nov 25 '20 at 10:04

3 Answers3

6

So basically we need to conditionally sample one element from the x vector that have not been selected in the min.dist-1 runs. Using purrr's reduce we can achieve this:

min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"

Bundled in a function

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
> shuffle(x, 3)
 [1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
 [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
 [1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
 [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
 [1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
 [1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"

after @27ϕ9 comment:

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))

A B C D E 
3 3 3 3 3 
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) : 
  invalid first argument

UPDATE

After some trial and error, looking at the fact that not always you're gonna have enough elements to space out the min.dist I came up with a solution this code is the most explained from the ones above :

shuffle <- function(x, min.dist=2){
    stopifnot(min.dist < length(unique(x)))
    reduce(integer(length(x)-1), function(.x, ...){
        # whether the value is in the tail of the aggregated vector
        in.tail <- x %in% tail(.x, min.dist)
        # whether a value still hasn't reached the max frequency
        freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
        # whether a value isn't in the aggregated vector
        yet <- !x %in% .x
        # the if is there basically to account for the cases when we don't have enough vars to space out the vectors
         c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else  x[which(freq.got)[1]] )
    }, .init=sample(x,1))
}

now running the table(shuffle(rep(LETTERS[1:5], 3),2)) will always return 3 for all vars and we can say with some certainty that in the vector the variables are spaced with a minimum distance of 2. the only way to guarantee that no elements are duplicated is by using min.dist=length(unique(x))-1 otherwise there will be instances where at maximum r < min.dist elements are not min.dist distanced from their last occurrences, and if such elements exist they're going to be in the length(x) + 1 - 1:min.dist subset of the resulting vector.

Just to be completely certain using a loop to check whether tail of the output vector has unique values: (remove the print statement I used it just for demonstration purposes)

shuffler <- function(x, min.dist=2){
    while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
    l
}

table(print(shuffler(rep(LETTERS[1:5], 3),2)))
 [1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"

A B C D E 
3 3 3 3 3 

table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
 [1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"

A B C D E 
3 3 3 3 3 

Update:

shuffler <- function(x, min.dist=2){
    while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
    l
}

this new version does a rigorous test on whether the elements in the tail of the vector are min.distanced, the previous version works for min.dist=2, however this new version does better testing.

Abdessabour Mtk
  • 3,895
  • 2
  • 14
  • 21
  • @27ϕ9 thanks for the remark, it escaped me at first. can u check the edit. – Abdessabour Mtk Nov 20 '20 at 15:20
  • Testing this with `shuffle(rep(letters[1:2], c(2, 3)), 1)` I sometimes get "ababb" which is invalid; the only possible permutation that could be sampled should be "babab". – Mikko Marttila Nov 23 '20 at 16:32
  • @MikkoMarttila Thanks for the remark, all you need to do is replace the call with `shuffler`. – Abdessabour Mtk Nov 23 '20 at 16:57
  • 1
    @AbdessabourMtk I'm very sorry for the late response! I wanted to apply your code to my real data and had to do some preparations before. I just applied your code to my data and it works perfectly fine. Thank you so much for all your effort and testing/development of the code!! – Joachim Schork Nov 25 '20 at 10:04
6

If your data is large, then it may be (way) faster to rely on probability to do that kind of task.

Here's an example:

prob_shuffler = function(x, min.dist = 2){
    n = length(x)
    res = sample(x)
    OK = FALSE
    
    # We loop until we have a solution
    while(!OK){
        OK = TRUE
        for(i in 1:min.dist){
            # We check if identical elements are 'i' steps away
            pblm = res[1:(n-i)] == res[-(1:i)]
            if(any(pblm)){
                if(sum(pblm) >= (n - i)/2){
                    # back to square 1
                    res = sample(x)
                } else {
                    # we pair each identical element with 
                    # an extra one
                    extra = sample(which(!pblm), sum(pblm))
                    id_reshuffle = c(which(pblm), extra)
                    res[id_reshuffle] = sample(res[id_reshuffle])
                }

                # We recheck from the beginning
                OK = FALSE
                break
            }
        }
    }

    res
}

Even though the while loop looks scary, in practice convergence is fast. Of course, the lower the probability to have two characters at min.dist away, the faster the convergence.

The current solutions by @Abdessabour Mtk and @Carles Sans Fuentes work but, depending on the size of the input data, quickly become prohibitively slow. Here's a benchmark:

library(microbenchmark)

x = rep(c(letters, LETTERS), 10)
length(x)
#> [1] 520

microbenchmark(prob_shuffler(x, 1), shuffler_am(x, 1), shuffler_csf(x, 1), times = 10)
#> Unit: microseconds
#>                 expr       min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 1)    87.001   111.501    155.071   131.801   192.401    264.401    10
#>    shuffler_am(x, 1) 17218.100 18041.900  20324.301 18740.351 22296.301  26495.200    10
#>   shuffler_csf(x, 1) 86771.401 88550.501 118185.581 95582.001 98781.601 341826.701    10

microbenchmark(prob_shuffler(x, 2), shuffler_am(x, 2), shuffler_csf(x, 2), times = 10)
#> Unit: microseconds
#>                 expr     min        lq       mean    median        uq        max neval
#>  prob_shuffler(x, 2)   140.1   195.201   236.3312   245.252   263.202    354.101    10
#>    shuffler_am(x, 2) 18886.2 19526.901 22967.6409 21021.151 26758.800  29133.400    10
#>   shuffler_csf(x, 2) 86078.1 92209.901 97151.0609 97612.251 99850.101 107981.401    10

microbenchmark(prob_shuffler(x, 3), shuffler_am(x, 3), shuffler_csf(x, 3), times = 10)
#> Unit: microseconds
#>                 expr       min        lq        mean     median       uq        max neval
#>  prob_shuffler(x, 3)   318.001   450.402    631.5312    573.352    782.2   1070.401    10
#>    shuffler_am(x, 3) 19003.501 19622.300  23314.4808  20784.551  28281.5  32885.101    10
#>   shuffler_csf(x, 3) 87692.701 96152.202 101233.5411 100925.201 108034.7 113814.901    10

We can remark two things: a) in all logic, the speed of prob_shuffler depends on min.dist while the other methods not so much, b) prob_shuffler is about 100-fold faster for just 520 observations (and it scales).

Of course if the probability to have two identical characters at min.dist away is extremely high, then the recursive methods should be faster. But in most practical cases, the probability method is faster.

Laurent Bergé
  • 1,292
  • 6
  • 8
  • 1
    This is a really interesting approach! Perhaps a `max.iter` or similar argument would be useful though, in case the input is something where it's not possible to achieve convergence. – Mikko Marttila Nov 26 '20 at 09:04
  • @LaurentBergé Thanks a lot for this alternative method! In fact, I noticed that the other approaches took some time when I increased the min.dist even though my data is not that large yet. So I think your approah is awesome for everybody with larger data. – Joachim Schork Nov 26 '20 at 09:22
  • @MikkoMarttila: sure, and the algorithm can be improved, but it was just to give a quick and dirty alternative. – Laurent Bergé Nov 26 '20 at 13:01
3

I hope this answer works fine for you. It is done with base R, but it works. I leave the printing if you want to check line by line:

x <- rep(LETTERS[1:5], 3)  # Create example vector


shuffle <- function(x, min_dist=3){
  #init variables   
  result<-c() # result vector
  count<-0
  vec_use<-x
  vec_keep<-c()
  for(i in 1:length(x)){
#    print(paste0("iteration =", i))
    if (count>min_dist){
      valback<-vec_keep[1]
#      print(paste0("value to be returned:",  valback))
      ntimes_valback<-(table(vec_keep)[valback])
      vec_use<- c(vec_use,rep(valback,ntimes_valback))
#      print(paste0("vec_use after giving back valbak =", valback))
#      print(paste0(vec_use,","))
      vec_keep <- vec_keep[!vec_keep %in% valback]
#      print(paste0("vec_keep after removing valback =", valback))
#      print(paste0(vec_keep,","))
    }
    val<-sample(vec_use,1)
#    print(paste0("val = ",val))#remove value
    vec_keep<- c(vec_keep,x[x %in% val])
    vec_keep<-vec_keep[1:(length(vec_keep)-1)]#removing 1 letter
#    print(paste0("vec_keep ="))
#    print(paste0(vec_keep,","))
    vec_use <- vec_use[!vec_use %in% val]
#    print(paste0("vec_use ="))
#    print(paste0(vec_use,","))
    result[i]<-val
    count<-count+1
    }
  return(result)
}
shuffle(x)
"C" "D" "B" "E" "C" "A" "B" "D" "E" "A" "C" "D" "B" "E" "C"
Carles
  • 2,731
  • 14
  • 25
  • Thank you so much for your code. It works perfectly fine with my data! I gave the "accepted answer" to Abdessabour Mtk, since he answered first. But your solution works as well. – Joachim Schork Nov 25 '20 at 10:06
  • 1
    No worries !, i Think this one was clearer on the logic (but happy to have helped !) – Carles Nov 25 '20 at 13:26