Minimal number of coverage of big data lists

Question

Following my question I use the following code:

dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
....
p32<-c('att35','att34','att32')

In the real case there can be 1024 vectors. I would like to find all the relevant p that the unification of them will be the maximal components of dist. I this case the solution would be p1, p3, p5. I want to choose the minimal number of p. In addition, in case there is no way to cover all the of dist component so I want to choose the maximal cover with minimal number of vectors (p).

N = 32
library(qdapTools)
library(dplyr)
library(data.table)
## generate matrix of attributes
attribute_matrix <- mtabulate(list(p1, p2, p3, p4, p5,...,p32))

library (bigmemory)
## generate matrix of attributes
grid_matrix <- do.call(CJ, rep(list(1:0), N))  %>% as.big.matrix

Error: cannot allocate vector of size 8.0 Gb

I tried an alternative way for it:

grid_matrix <- do.call(CJ, rep(list(1:0), N))  %>% as.data.frame
grid_matrix <- as.matrix (grid_matrix)

And still got the same error.

How can I fix it and use it for big data? I wanted to continue with:

 colnames(grid_matrix) <- paste0("p", 1:N)
    combin_all_element_present <- rowSums(grid_matrix %*% attribute_matrix > 0) %>% `==`(., ncol(attribute_matrix))
    grid_matrix_sub <- grid_matrix[combin_all_element_present, ]
    grid_matrix_sub[rowSums(grid_matrix_sub) == min(rowSums(grid_matrix_sub)), ]

G. Grothendieck · Accepted Answer · 2017-06-21T10:50:54.333

2

This is known as a set covering problem. It can be solved using integer linear programming. Let x1, x2, ... be 0/1 variables (one for each p variable) and represent p1, p2, ... as 0/1 vectors P1, P2, ... and dist as a 0/1 vector D. Then the problem can be stated as:

min x1 + x2 + ... + x32
such that
P1 * x1 + P2 + x2 + ... + P32 * x32 >= D

which in R code is the following. First create a list p with the p vectors in sorted order. Use mixedsort so that p32 comes at the end instead of rigth after p3. Define attnames as the set of all att names in all the p vectors. Then formulate the objective function (which equals the number of p's in the cover), the constraint matrix (consisting of the P vectors as columns) and the right hand side of the constraint equations (which is dist as a 0/1 vector). Finally run the integer linear program and convert the solution from a 0/1 vector to a vector of p names.

library(gtools)
library(lpSolve)

p <- mget(mixedsort(ls(pattern = "^p\\d+$")))
attnames <- mixedsort(unique(unlist(p)))
objective <- rep(1L, length(p))
const.mat <- sapply(p, function(x) attnames %in% x) + 0L
const.rhs <- (attnames %in% dist) + 0L

ans <- lp("min", objective, const.mat, ">=", const.rhs, all.bin = TRUE)
names(p)[ans$solution == 1L]
## [1] "p2" "p4" "p5"

The constraint matrix has a row for each attnames entry and a column for each p vector.

The solution produces the minimal covers of those attnames elements that are in dist. If every element of dist appears in at least one p vector then the solution will represent a cover of dist. If not, the solution will represent a cover of those att names in one or more p vectors that are also in dist; thus, this handles both cases discussed in the question. The uncovered elements of dist are:

setdiff(dist, attnames)

so if that is of zero length then the solution represents a complete cover of dist. If not the solution represents a cover of

intersect(dist, attnames)

The sorting done in the code is not stricly needed but it may be easier to work with the various inputs to the optimization by having the rows and columns of the constraint matrix in a logical order.

Note: Run this code from the question before running the above code:

dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
p32<-c('att35','att34','att32')

edited Jun 21 '17 at 10:50

answered Jun 19 '17 at 14:25

G. Grothendieck

254,981
17
203
341

Thanks a lot @G. Grothendieck! I would like to ask: (1) Does using the mixedsort mean that each vector number will be after the vector number before (i.e. p1,p2,p3,p4...,p31,p32)? (2) If there is no optimal solution, I would like to get the maximal cover, but not the full cover does the code support it? and finally, is it the most efficient way to solve the problem? – Avi Jun 19 '17 at 14:39
Thanks again! I get an error: > const.mat <- sapply(p, function(x) attnames %in% x) + 0L Error in attnames %in% x : object 'attnames' not found In addition if I have a list called nam which all the vectors are in there (from 1 to 32 i.e. nam[[1]]..nam[[32]]) Do I have to change: p <- mget(mixedsort(ls(nam))? Rerun with Debug Error in attnames %in% x : object 'attnames' not found Rerun with Debug Error in attnames %in% x : object 'attnames' not found – Avi Jun 19 '17 at 14:48
If I have instead of p a list that contains all the p i.e. list called nam that nam[[1]] is p1... nam[[32]] is p32 do I have to change the line into p <- mget(mixedsort(ls(nam)))? – Avi Jun 19 '17 at 14:50
When I enter the value of list in p such as: > head (nam) [[1]] [1] "V2" "V13" "V6" "V12" "V17" "V4" "V5" "V16" "V14" "V11" "V3" "V7" "V9" [[2]] [1] "V9" "V14" "V10" "V13" "V12" "V16" "V4" "V2" "V7" "V5" "V11" "V6" [[3]] [1] "V7" "V9" "V2" "V5" "V12" "V14" "V17" "V15" "V8" "V6" "V16" "V10" "V3" "V4" "V13" [[4]] [1] "V10" "V3" "V11" "V4" "V13" "V8" "V2" "V17" "V14" "V16" "V6" "V5" "V12" "V9" [[5]] [1] "V16" "V3" "V14" "V12" "V2" "V6" "V13" "V15" "V10" "V9" "V5" "V8" I get > names(p)[ans$solution == 1L] character(0) – Avi Jun 19 '17 at 20:51
@ G. Grothendieck, I have a list called 'nam' (it is same as p). What do I have to insert to 'p' or to the code to work on the contents of 'nam' > str (nam) List of 31 $ : chr [1:13] "V2" "V13" "V6" "V12" ... $ : chr [1:12] "V9" "V14" "V10" "V13" ... $ : chr [1:15] "V7" "V9" "V2" "V5" ... $ : chr [1:14] "V10" "V3" "V11" "V4" ... $ : chr [1:12] "V16" "V3" "V14" "V12" .. – Avi Jun 20 '17 at 16:35
1

It's likely you just have to set p <- nam instead of the p <- ... line in the answer. Of course dist would have to be defined in terms of V instead of att to be consistent. – G. Grothendieck Jun 20 '17 at 17:20
I get > names(p)[ans$solution == 1L] [1] "" where I used dist<-colnames(....) and p <- nam instead of p <- mget(mixedsort(ls(pattern = "^p\\d+$"))). You can see that > dist [1] "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12" "V13" "V14" "V15" "V16" "V17" > str (dist) chr [1:16] "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12" "V13" "V14" "V15" ... > class (dist) [1] "character" and nam is as described above, Why is it? – Avi Jun 20 '17 at 17:58
I can't really read this. If you can provide a reproducible example legibly formatted that I can copy and paste into R then I might be able to respond. – G. Grothendieck Jun 20 '17 at 18:03
In pleasure. Can I send you via email or chat? – Avi Jun 20 '17 at 18:05
None of your attnames are in dist so the correct solution as a 0/1 vector is all zeros which is what you got. – G. Grothendieck Jun 20 '17 at 20:06
The solution down here works for it. I saw difference in the names of the matrix. The nam does not have names for each vector while p (in the source example) does have for each row (vector). Might this be the reason? – Avi Jun 20 '17 at 20:13
1

Your `p` has no names but you are referring to `names(p)` in the last line of code. Use the 0/1 vector `ans$solution` as your solution. – G. Grothendieck Jun 20 '17 at 20:23

User2321 · Answer 2 · 2017-06-20T07:28:58.300

The answer already provided is perfect but another approach could be the following:

dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')


library(qdapTools)
library(data.table)
attribute_matrix <- mtabulate(list(p1, p2, p3, p4, p5))


minimal_sets <- function(superset, subsets_matrix, p){

  setDT(subsets_matrix)
  # removing the columns that are not in the superset
  updated_sub_matr <- subsets_matrix[, which(names(subsets_matrix) %in% superset), with = F]

  # initializing counter for iterations and the subset selected 
  subset_selected <- integer(0)
  counter <- p

  ## Loop until either we ran out of iterations counter = 0 or we found the solution
  while (counter > 0 & length(superset) > 0){

    ## find the row with the most matches with the superset we want to achieve  
    max_index <- which.max(rowSums(updated_sub_matr))

    ## remove from the superset the entries that match that line and from the subsets_matrix those columns as they dont contribute anymore
    superset <- superset[which(updated_sub_matr[max_index, ] == 0)]
    updated_sub_matr <- updated_sub_matr[, - which(updated_sub_matr[max_index, ] != 0), with = F]

    counter <- counter - 1
    subset_selected <- c(subset_selected, max_index)
  }

  if (length(superset) > 0){
    print(paste0("No solution found, there are(is) ", length(superset), " element(s) left ", paste(superset, collapse = "-")))            
  } else {            
    print(paste0("Found a solution after ", p - counter, " iterations"))           
  }

  print(paste0("Selected the following subsets: ", paste(subset_selected, collapse = "-")))

}

In this function you input your superset (in this case dist), the attribute_matrix and the number p which you want to check and it outputs the best possible solution it found as well as the iterations.

> minimal_sets(dist, attribute_matrix, 1)
[1] "No solution found, there are(is) 3 element(s) left att3-att4-att6"
[1] "Selected the following subsets: 1"

> minimal_sets(dist, attribute_matrix, 3)
[1] "Found a solution after 3 iterations"
[1] "Selected the following subsets: 1-3-5"

> minimal_sets(dist, attribute_matrix, 5)
[1] "Found a solution after 3 iterations"
[1] "Selected the following subsets: 1-3-5

Minimal number of coverage of big data lists

2 Answers2