1

I have the following:

dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')

I would like to find all the relevant p that the unification of them will be the maximal components of dist. I this case the solution would be p1, p3, p5. I want to choose the minimal number of p. In addition, in case there is no way to cover all the of dist component so I want to choose the maximal cover.

Sotos
  • 51,121
  • 6
  • 32
  • 66
Avi
  • 2,247
  • 4
  • 30
  • 52
  • Why only `p1, p3, p5`? Isn't `p2, p3, p5` also the same? – Sotos Jun 18 '17 at 10:22
  • Thanks @Sotos, you are right. In this case since p1 and p2 have the same number of attributes it can be also a solution. For me one solution is good enough (I don't have to get all of them) only the first ones who meet the constraints. – Avi Jun 18 '17 at 10:48

1 Answers1

1

Here is my attempted solution. I've tried as much I can to vectorize/matricize hope it's fast enough. Each step is explained in the comment

library(qdapTools)
library(dplyr)
library(data.table)
## generate matrix of attributes
grid_matrix <- do.call(CJ, rep(list(1:0), 5))  %>% as.matrix
attribute_matrix
##   att1 att2 att3 att4 att5 att6
## 1    1    1    0    0    1    0
## 2    1    0    0    1    1    0
## 3    0    1    1    1    0    0
## 4    1    1    1    0    0    0
## 5    0    0    0    0    0    1

## create a grid of combination of matrix
grid_matrix <- do.call(CJ, rep(list(1:0), 5))  %>% as.matrix
colnames(grid_matrix) <- paste0("p", 1:5)

## check whether each combination has all attribute presented
combin_all_element_present <- rowSums(grid_matrix %*% attribute_matrix > 0) %>% 
  `==`(., ncol(attribute_matrix))

combin_all_element_present
##  [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

## generate a submatrix which satisfy the condition
grid_matrix_sub <- grid_matrix[combin_all_element_present, ]
## find the combinations with minumun number of p
grid_matrix_sub[rowSums(grid_matrix_sub) == min(rowSums(grid_matrix_sub)), ]
##      p1 p2 p3 p4 p5
## [1,]  0  1  0  1  1
## [2,]  0  1  1  0  1
## [3,]  1  0  1  0  1

Note

In case you want to use quanteda, you can generate attribute_matrix with

library(quanteda)
attribute_matrix <- lapply(list(p1, p2, p3, p4, p5), function(x) paste(x, collapse = ' ')) %>% 
  unlist %>% tokens %>% dfm %>% as.matrix
attribute_matrix
##        features
## docs    att1 att5 att2 att4 att3 att6
##   text1    1    1    1    0    0    0
##   text2    1    1    0    1    0    0
##   text3    0    0    1    1    1    0
##   text4    1    0    1    0    1    0
##   text5    0    0    0    0    0    1
amatsuo_net
  • 2,409
  • 11
  • 20
  • Thanks I get the following error:> attribute_matrix <- lapply(list(p1, p2, p3, p4, p5), function(x) paste(x, collapse = ' ')) %>% + unlist %>% tokens %>% dfm %>% as.matrix Error in validObject(.Object) : invalid class “dfmSparse” object: superclass "replValueSp" not defined in the environment of the object's class – Avi Jun 18 '17 at 11:13
  • Hmm. It should be a version issue of `quanteda`. What's your version? – amatsuo_net Jun 18 '17 at 11:18
  • > packageVersion("quanteda") [1] ‘0.9.9.65’ . I installed the Matrix package and now my error is Error in validObject(.Object) : invalid class “dfmSparse” object: superclass "replValueSp" not defined in the environment of the object's class – Avi Jun 18 '17 at 11:20
  • Could you restart session, reinstall `quanteda`, and run the code? It seems a known issue for quanteda... https://stackoverflow.com/questions/42025827/quanteda-invalid-class-dfmsparse-object – amatsuo_net Jun 18 '17 at 11:23
  • Done and still does not work, same error. Is there an alternative to the line? – Avi Jun 18 '17 at 11:28
  • 1
    I will think about it – amatsuo_net Jun 18 '17 at 11:29
  • 1
    See my update. I am not really familiar with `qdapTools` but the output is the same. – amatsuo_net Jun 18 '17 at 11:43
  • Thanks it works! So practically I can choose grid_matrix_sub first row if I want only the first solution? – Avi Jun 18 '17 at 11:48
  • 1
    Yes. That's right. You can pick any one row from the outcome matrix. – amatsuo_net Jun 18 '17 at 11:49
  • `data.table::CJ` perhaps but this operation is always an expensive one. – David Arenburg Jun 18 '17 at 11:58
  • Thanks. It's good to know. Is this efficient? `do.call(CJ, list(0:1, 0:1, 0:1, 0:1, 0:1))` – amatsuo_net Jun 18 '17 at 12:05
  • I get an error: > combin_all_element_present <- rowSums(grid_matrix %*% attribute_matrix > 0) %>% + `==`(., ncol(attribute_matrix)) Error in grid_matrix %*% attribute_matrix : requires numeric/complex matrix/vector arguments – Avi Jun 18 '17 at 12:54
  • I get the following error: grid_matrix <- do.call(CJ, rep(list(1:0), N)) %>% as.matrix Error: cannot allocate vector of size 8.0 Gb I work with N=32 and can be much bigger. How can I work with this big data? – Avi Jun 19 '17 at 07:06
  • The first thing I would try is to restrict the matrix with the known minimum number of row sums in each time when `CJ` is called. For example, if you know that there is a combination of 10 `p`s which satisfy the maximum coverage, all you need is to have a matrix with the maximum row sums of 9. If this still doesn't work, another possibility is to create a matrix with constant row sums and increment the number. For instance, create a matrix with all row sums == 2. If there is not a combination in this matrix increment the row sums until you find the answer. – amatsuo_net Jun 19 '17 at 07:54
  • Thanks a lot. Can you please show an example of how to do it? Since my number of p can be very big (1024)? – Avi Jun 19 '17 at 08:51
  • If the number of p can be that big, there has to be very different solution for the problem. For N = 1024, the combination you can explore with one matrix is 3, and I assume that cannot be a solution. I would recommend opening a new question stating that you need a large data solution. If you still need smaller combination solution, I can work on it. – amatsuo_net Jun 19 '17 at 09:44
  • Thanks a lot @amatsuo_net, I would like to ask one more little question. Can I open a discussion in chat? – Avi Jun 19 '17 at 11:20
  • Yes. You can convert this entire discussion to chat. – amatsuo_net Jun 19 '17 at 14:20