Faster way to achieve filtering for all possible combinations

Question

Consider I have a data frame like this,

set.seed(1)

q<-100

df <- data.frame(Var1 = round(runif(q,1,50)),
                    Var2 = round(runif(q,1,50)),
                        Var3 = round(runif(q,1,50)),
                            Var4 = round(runif(q,1,50)))
attach(df)

As you realized, q is standing for setting the length of the each columns in the dataframe.

I want to make a filtering of all possible combinations of the columns. It can be anything. Let's say I am seeking for if the devision of the sums of the first two columns and the sums of the last two columns greater than 1 or not.

One thing to achieve that, using expand.grid() function.

a <- Sys.time()

expanded <- expand.grid(Var1, Var2, Var3, Var4)

Sys.time() - a

Time difference of 8.31997 secs


expanded  <- expanded[rowSums(expanded[,1:2])/ rowSums(expanded[,3:4])>1,]

However it takes a lot time! To make it faster, I tried to follow the answer with rep.int() function in this question and designed my own function.

myexpand <- function(...) {

 sapply(list(...),function(y) rep.int(y, prod(lengths(list(...)))/length(y)))

}

But it is not so promising again. It takes more time comparing to my expectation and the expand.grid also.And, If I set a greater q, it becomes a nigthmare!

Is there a proper way to achieve this a lot faster (1-2 seconds) with maybe matrix operations before applying expand.grid or myexpand . And, I wonder if it is a weakness of using an interpreted language like R... Software suggestions are also acceptable.

Maybe try parallelization...? or maybe an `Rcpp` implementation? — Sotos, Aug 12 '19 at 08:25
@Sotos I have no idea about parallelization and how to apply it in R. But I will have a look at. — maydin, Aug 12 '19 at 08:29
Here is a [link](https://stackoverflow.com/q/31254476/5635580) to get you started — Sotos, Aug 12 '19 at 08:44
you might want to check out https://cran.r-project.org/web/packages/RcppAlgos/index.html — chinsoon12, Aug 13 '19 at 05:24

chinsoon12 · Answer 1 · 2019-09-17T01:00:05.903

3

For this particular condition (i.e. ratio of sums > 1), you might want to consider using the data.table package:

system.time({
    #generate permutations of Var1 & Var2 and Var3 & Var4
    DT12 <- DT[, CJ(Var1=Var1, Var2=Var2, unique=TRUE)][, s12 := Var1 + Var2]
    DT34 <- DT[, CJ(Var3=Var3, Var4=Var4, unique=TRUE)][, s34 := Var3 + Var4]

    #perform a non-equi join
    DT12[DT34, on=.(s12>s34), allow.cartesian=TRUE,
        .(Var1=x.Var1, Var2=x.Var2, Var3=i.Var3, Var4=i.Var4)][, s12:=NULL]
})

timing:

   user  system elapsed 
   0.02    0.06    0.08

output:

         Var1 Var2 Var3 Var4
      1:    2    5    2    4
      2:    4    3    2    4
      3:    5    2    2    4
      4:    2    6    2    4
      5:    4    4    2    4
     ---                    
1753416:   50   49   49   48
1753417:   50   50   49   48
1753418:   50   49   49   49
1753419:   50   50   49   49
1753420:   50   50   49   50

data:

library(data.table)
set.seed(1)
q <- 100
DT <- data.table(Var1 = round(runif(q,1,50)),
    Var2 = round(runif(q,1,50)),
    Var3 = round(runif(q,1,50)),
    Var4 = round(runif(q,1,50)))

edit: For summing of positive numbers, you can prob use the following (caveat: it will not be faster than using a Rcpp approach).

system.time({
    S <- DT[, .(UB=90 - Var1, C1=Var1)]
    for (k in 2:4) {
        S <- DT[S, on=paste0("Var", k, "<UB"), allow.cartesian=TRUE,
            mget(c(names(S), paste0("x.Var", k)))]
        setnames(S, paste0("x.Var", k), paste0("C", k))
        S[, UB := UB - get(paste0("C",k))]
    }
    S[, UB := NULL][rowSums(S)>30L]
})

timing:

   user  system elapsed 
   3.48    4.06    3.51

output, S:

> S
          C1 C2 C3 C4
       1: 14 33 14  6
       2: 14 33 14 25
       3: 14 33 14 24
       4: 14 33 14 19
       5: 14 33 14 10
      ---            
34914725: 31 39  3  8
34914726: 31 39  3  8
34914727: 31 39  3  9
34914728: 31 39  3 16
34914729: 31 39  3  8

edited Sep 17 '19 at 01:00

answered Aug 13 '19 at 05:29

chinsoon12

25,005
4
25
35

I got the error `object 'x.Var1' not found`. Additionally, if I run `nrow(expanded)` after filtering by `>1` or not, I get **55615910** but you found **1753420**.And one thing more, the dataframe example is just a fake data created for make the question replicable.In reality, `expand.grid` function also works with different number of inputs. I.e. `expand.grid(1:100,1:5,1:1000)` . I don't really know if datatable is fine with list type inputs... – maydin Aug 13 '19 at 07:24
i am guessing that your solution has duplicates. `data.table::CJ` can take in any number of vectors as well – chinsoon12 Aug 13 '19 at 07:33
1

I don't understand but I still get `object 'x.Var1' not found` error. And one more thing just to make it easier, what if we make the filtering say total sum of all variables greater than 30 and less than 90. If I apply this, `DTnew <- DT[, CJ(Var1,Var2,Var3, Var4, unique=FALSE)][, Total := Var1+ Var2 +Var3 + Var4]` . It takes more time comparing to `expand.grid`. Sorry I am not so familiar with DT. – maydin Aug 13 '19 at 22:22
I will take a look at the error later. For sum, you can use Joseph Wood’s RcppAlgos package – chinsoon12 Aug 13 '19 at 22:24
1

I talked to Joseph Wood on this site under a [question](https://stackoverflow.com/questions/22569176/r-permutations-and-combinations-with-without-replacement-and-for-distinct-non-d/47983855?noredirect=1#comment101424480_47983855) he answered. And he said he was sorry about the Cartesian Product.Unfortunately `RcppAlgos` package does not have an option for it but it is something that he is working on...He also saw this question as he said. By the way, thank you for your help and sorry for the delay of the answer. – maydin Aug 13 '19 at 22:28
@maydin i am not getting the error. Did you use your own data.frame or my data.table data? if u are using the former, can you set it to data.table using `setDT(df)` – chinsoon12 Aug 14 '19 at 00:52
Sorry but I still get. And I used your DT input. It is happening `DT12[DT34, on=.(s12>s34), allow.cartesian=TRUE, .(Var1=x.Var1, Var2=x.Var2, Var3=i.Var3, Var4=i.Var4), by=.EACHI][, s12:=NULL]` part.Anyway, I will upvote your answer.. – maydin Aug 14 '19 at 09:23
can you share your data.table version? – chinsoon12 Aug 14 '19 at 10:05
`packageVersion("data.table")` gives `1.12.2` – maydin Aug 14 '19 at 10:09
so if u run the data code in a new R session and then the top code panel, you get the `object 'x.Var1' not found`? – chinsoon12 Aug 14 '19 at 10:10
2

woah thanks. learned about streamable.com today. let me chew over it some more. maybe there might be some seed that can trigger the error. can you print DT12 and DT34? i added something to specify the names explictly. strange error – chinsoon12 Aug 14 '19 at 10:24
@chinsoon12 why the ```.EACHI```? Since there is no aggregation, there's probably no memory saving using it, right...? – Cole Sep 07 '19 at 13:17
@Cole, thanks for that. Yeah not required. Not in front of computer to test though – chinsoon12 Sep 07 '19 at 21:59

Kelly · Answer 2 · 2019-08-12T16:34:30.283

0

How about using tidyr and dplyr?

library(tidyr)
library(dplyr)

expanded <- 
  df %>% 
  expand(Var1, Var2, Var3, Var4)
  mutate(res = (Var1 + Var2)/(Var3 + Var4)) %>%
  filter(res > 1)

Note that expand works slightly differently from expand.grid - it only keeps the unique rows - so this may not be what you need.

On my computer, this all takes less than a second.

edited Aug 12 '19 at 16:34

answered Aug 12 '19 at 14:59

Kelly

43
5

How did you calculate `expanded` ? It is the real problematic part. – maydin Aug 12 '19 at 15:19
I misunderstood your question. I thought it was just the calculation and filtering step that you wanted to speed up. I've now edited my response to include the expansion step. – Kelly Aug 12 '19 at 16:37
1

Thanks for your effort. I am really appreciated. However, this problem is not about using `base R` versus `dplyr`. It is much more complicated. And using pipe operator is not a solution. Because I am seeking for possible matrix operations before expanding which result in a **real fast** solution. Once you run the codes on your computer, you can realize the situation. If you follow my question step by step, and read the commands like parallelization suggestion, you will understand the real topic. If the `%>%` were the case, believe me, I could handle it without asking. Thank you very much. – maydin Aug 12 '19 at 16:47

Faster way to achieve filtering for all possible combinations

2 Answers2

Linked