1

Here is my problem - I would like to generate a fairly large number of factorial combinations and then apply some constraints on them to narrow down the list of all possible combinations. However, this becomes an issue when the number of all possible combinations becomes extremely large. Let's take an example - Assume we have 8 variables (A; B; C; etc.) each taking 3 levels/values (A={1,2,3}; B={1,2,3}; etc.). The list of all possible combinations would be 3**8 (=6561) and can be generated as following:

tic <- function(){start.time <<- Sys.time()}
toc <- function(){round(Sys.time() - start.time, 4)}

nX = 8

tic()
lk = as.list(NULL)
lk = lapply(1:nX, function(x) c(1,2,3))
toc()

tic()
mapx = expand.grid(lk)
mapx$idx = 1:nrow(mapx)
toc()

So far so good, these operations are done pretty quickly (< 1 second) even if we significantly increase the number of variables.

The next step is to generate a corrected set of all pairwise comparisons (An uncorrected set would be obtain by freely combining all 6561 options with each other, leading to 65616561=43046721 combinations) - The size of this "universe" would be: 6561(6561-1)/2 = 21520080. Already pretty big!

I am using the R built-in function combn to get it done. In this example the running time remains acceptable (about 20 seconds on my PC) but things become impossible with higher higher number of variables and/or more levels per variable (running time would increase exponentially, for example it already took 177 seconds with 9 variables!). But my biggest concern is actually that the object size would become so large that R can no longer handle it (Memory issue).

tic()
univ = t(combn(mapx$idx,2))
toc()

The next step would be to identify the list of combinations meeting some pre-defined constraints. For instance I would like to sub-select all combinations sharing exactly 3 common elements (ie 3 variables take the same values). Again the running time will be very long (even if a 8 variables) as my approach is to loop over all combinations previously defined.

tic()
vrf = NULL
vrf = sapply(1:nrow(univ), function(x){
  j1 = mapx[mapx$idx==univ[x,1],-ncol(mapx)]
  j2 = mapx[mapx$idx==univ[x,2],-ncol(mapx)]
  cond = ifelse(sum(j1==j2)==3,1,0)
  return(cond)})
toc()

tic()
univ = univ[vrf==1,]
toc()

Would you know how to overcome this issue? Any tips/advices would be more than welcome!

Nicolas K
  • 111
  • 3
  • 2
    Generating all combination only to discard most of them seems wasteful. I would, as a first step, write an Rcpp function that only stores the combinations that fulfill the conditions. However, ultimately, the memory and performance issue might not be solvable. You should consider if you really need to create these combinations or if there isn't a better approach towards you actual goal. – Roland Dec 20 '21 at 11:35
  • Many thanks Roland for your prompt answer! Indeed generating the universe of combinations to only select few of them is a waste of resources. I've never done Rcpp, but might be a nice thing to learn over xmas :-) Alternatively would you know a nice algorithm to sample from the list of initial options (in my example, 6561) in a smart way? I guess "sample()" is not optimal? – Nicolas K Dec 20 '21 at 11:52
  • Depending on the nature of your constraints, there might already be a package providing what you need. I don't work with combinations and permutations but I somewhat remember that such a package exists. Unfortunately, I have forgotten how it was called. – Roland Dec 20 '21 at 12:04
  • seems like you could get around the memory problem but generating subsets of the combinations space and applying your constraint to each subset then discarding the subsets. This, however, will not solve the compute problem. For that, there is no way to avoid writing a more complicated algo that generates the constrained combinations. i.e select all permutations of 3 vars that are the same and then add all possible combos to the rest of the vars. – Eric Dec 20 '21 at 12:07
  • not sure if @Roland is thinking of linear programming package like `lpSolve`. This will do a good job finding *a solution* for the constrained equation but not *all solutions*. – Eric Dec 20 '21 at 12:17
  • This seems like a job for `for` loops. You can generate each combination in turn. If it passes your filter, you can add it to your list. If that's not fast enough, try Rcpp. – dash2 Dec 20 '21 at 13:00
  • It all depends on what you want to do with the results. If you are interested in some kind of integration (e.g. find the average value of some function of each of the combinations that meets your condition), then Monte Carlo approximations can work: sample uniformly from the full universe, reject the ones that don't meet the condition, average the values for the ones that do. You can choose the accuracy by the number `N` of samples you take. The main limitation is that computing time is linear in `N`, while the error only falls like `1/sqrt(N)`. – user2554330 Dec 20 '21 at 13:19
  • Others have suggested writing your own function to generate combinations/permutations one at a time and check the validity. There are a couple of packages that provide combinatorial iterators. One is `arrangements` the other is `RcppAlgos` (I am the author). See here for examples: https://cran.r-project.org/web/packages/RcppAlgos/vignettes/CombinatoricsIterators.html – Joseph Wood Dec 21 '21 at 14:41

0 Answers0