0

is there a way to shuffle dataframe's rows based on a filter? For instance, I have this dataframe:

data=data.frame(id=c(3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26),
                name=c("restructuring","restructuring","restructuring","restructuring",
                       "control","control","control","control","clitic filler","clitic filler","clitic filler","clitic filler","clitic filler","clitic filler","clitic filler","clitic filler","action filler","action filler","action filler","action filler","action filler","action filler","action filler","action filler")
               )

In which numbers from 3 to 6 are 'restructuring', 7-10 are 'control', 11-18 are 'clitic filler', 19-26 are 'action filler', and I'd like name column to not have the same value in 2 consecutive rows.

I tried:

shuffled_data= data[sample(1:nrow(data)), ]

But this obviously randomizes with no criteria

Tom
  • 11
  • 1
  • I'm also doubtful there's a nice ready answer - with different numbers of rows with each condition there isn't guaranteed to be a general solution - if for example your input was 3 "restructuring" and 1 "control" a solution would be impossible. If this example is representative of the size of your real data, you may just want to generate 1000 or 10000 shuffles and throw out any that don't meet your criteria. If your actual data is much larger that could be impractical, and you might need to be more deterministic and less random. – Gregor Thomas Mar 06 '23 at 19:30

2 Answers2

1

If your data is about this size, I would do a bunch of random shuffles and find one(s) that meet your criteria:

shuffle = function(data) {
  data[sample(1:nrow(data)), ]
}

check = function(data) {
  all(data$name[-1] != data$name[-nrow(data)])
}

set.seed(47)
results = replicate(10000, shuffle(data), simplify = FALSE)
results = results[sapply(results, check)]
length(results)
[1] 10
## 10 of the 10000 shuffles meet your criteria

## here's one:
results[[1]]
#    id          name
# 16 18 clitic filler
# 21 23 action filler
# 9  11 clitic filler
# 20 22 action filler
# 15 17 clitic filler
# 24 26 action filler
# 1   3 restructuring
# 13 15 clitic filler
# 7   9       control
# 2   4 restructuring
# 19 21 action filler
# 6   8       control
# 4   6 restructuring
# 23 25 action filler
# 3   5 restructuring
# 22 24 action filler
# 10 12 clitic filler
# 18 20 action filler
# 12 14 clitic filler
# 5   7       control
# 11 13 clitic filler
# 8  10       control
# 17 19 action filler
# 14 16 clitic filler
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

Using the function from this answer with min.dist = 1:

library(data.table)

setorder(setDT(data), name)[
  frank(prob_shuffler(cumsum(!duplicated(name)), 1L), ties.method = "random")
]
#>     id          name
#>  1:  3 restructuring
#>  2: 10       control
#>  3: 17 clitic filler
#>  4:  7       control
#>  5:  5 restructuring
#>  6: 13 clitic filler
#>  7: 20 action filler
#>  8: 11 clitic filler
#>  9:  9       control
#> 10: 24 action filler
#> 11: 16 clitic filler
#> 12: 25 action filler
#> 13: 14 clitic filler
#> 14: 19 action filler
#> 15:  4 restructuring
#> 16: 18 clitic filler
#> 17: 22 action filler
#> 18:  6 restructuring
#> 19: 23 action filler
#> 20: 15 clitic filler
#> 21:  8       control
#> 22: 21 action filler
#> 23: 12 clitic filler
#> 24: 26 action filler
jblood94
  • 10,340
  • 1
  • 10
  • 15