0

I am running an analysis of a number of sets and I have been using the package VennDiagram, which has been working just fine, but it only handles up to 5 sets, and now it turns out that I need to look at 6 or more sets.

Ideally, I'm looking for a something that can do this (below) with 6 or more sets, but it doesn't necessarily have to have a plot function as long as the counts can be retrieved:

Venn diagram of 5 sets generated by the package VennDiagram

Any ideas of what I can do to add one or more sets to these five and still get the counts?

Thanks!

SiKiHe
  • 439
  • 6
  • 16
  • How is your data represented? The items you look for in those sets, and the sets themselves? – Jakub P. Jul 24 '15 at 09:02
  • In this case, they're vectors of city names (compiled from data frames from sales data bases). I'm looking overlaps etc to get an idea of the market coverage. From the picture above, it looks like almost everybody is trying to make sales in the same markets. – SiKiHe Jul 24 '15 at 09:09

3 Answers3

1

Here is a recursive solution to find all of the intersections in the venn diagram. sets can be a list containing any number of sets to find the intersections of. For some reason, the code in the package you are using is all hard-coded for each set size, so it doesn't scale to arbitrary intersections.

## Build intersections, 'out' accumulates the result
intersects <- function(sets, out=NULL) {
    if (length(sets) < 2) return ( out )                               # return result
    len <- seq(length(sets))
    if (missing(out)) out <- list()                                    # initialize accumulator
    for (idx in split((inds <- combn(length(sets), 2)), col(inds))) {  # 2-way combinations
        ii <- len > idx[2] & !(len %in% idx)                           # indices to keep for next intersect
        out[[(n <- paste(names(sets[idx]), collapse="."))]] <- intersect(sets[[idx[1]]], sets[[idx[2]]])
        out <- intersects(append(out[n], sets[ii]), out=out)
    }
    out
}

The function builds pairwise intersections. To avoid building repeated solutions it only calls itself on components of the set with indices greater than those that were joined (ii in the code). The result is a list of all the intersections. If you pass named components, then the result will be named by the convention "set1.set2" etc.

Results

## Some sample data
set.seed(0)
sets <- setNames(lapply(1:3, function(.) sample(letters, 10)), letters[1:3])

## Manually check intersections
a.b <- intersect(sets[[1]], sets[[2]])
b.c <- intersect(sets[[2]], sets[[3]])
a.c <- intersect(sets[[1]], sets[[3]])
a.b.c <- intersect(a.b, sets[[3]])

## Compare
res <- intersects(sets)
all.equal(res[c("a.b","a.c","b.c","a.b.c")], list(a.b=a.b, a.c=a.c, b.c=b.c, a.b.c=a.b.c))
# TRUE

res
# $a.b
# [1] "g" "i" "n" "e" "r"
# 
# $a.b.c
# [1] "g"
# 
# $a.c
# [1] "x" "g"
# 
# $b.c
# [1] "f" "g"

## Get the counts of intersections
lengths(res)
# a.b a.b.c   a.c   b.c 
#   5     1     2     2 

Or, with numbers

intersects(list(a=1:10, b=c(1, 5, 10), c=9:20))
# $a.b
# [1]  1  5 10
# $a.b.c
# [1] 10
# $a.c
# [1]  9 10
# $b.c
# [1] 10
Rorschach
  • 31,301
  • 5
  • 78
  • 129
0

OK, here's one way, assuming you represent sets as a list of vectors, and items to be searched in those sets also as vector:

# Example data format
sets <- list(v1 = 1:6, v2 = 1:8, v3 = 3:8)
items <- c(2:7)

# Search for items in each set
result <- data.frame(searched = items)
for (set in names(sets)) {
  result <- cbind(result, items %in% sets[[set]])
  names(result)[length(names(result))] <- set
}

# Count
library(plyr)
ddply(result, names(sets), function (i) {
  data.frame(count = nrow(i))
})

This gives you all combinations actually existing in the itemset:

     v1   v2    v3 count
1 FALSE TRUE  TRUE     1
2  TRUE TRUE FALSE     1
3  TRUE TRUE  TRUE     4
Jakub P.
  • 5,416
  • 2
  • 21
  • 21
  • What are the counts counting? You have three TRUE in the last line, but the count is four..? I need to know the number of elements in each intersection – SiKiHe Jul 24 '15 at 09:40
  • Maybe I have trouble understanding your output. For the data in your example, I'd like to know that the number of elements in V_1 \cap V_2 = {1, 2} = 2, the number of elements in V_2 cap V_3 = {7, 8} = 2 and that and that the number of elements in V_1 \cap V_2 \cap V_3 = {3,4,5,6} = 4, and that all other intersections are empty. – SiKiHe Jul 24 '15 at 09:53
  • OK, I may have solved a more general problem. Put the sum of all sets under the variable `items` and you'll get what you need. The code above allows for checking which sets an arbitrary other set intersects with. So a row in result data frame shows you how many items in the `items` variable belong to the v_i which are True. So row 1 tells you there is 1 item in (v2,v3) set. Row 2 says there's 1 item in (v1, v2) set. Row 3 says there are items in (v1, v2, v3) set. By set (v1,v2) I mean intersection of just v1 and v2. If you put ` items <- union(v1...vN) you'll get what you are after. – Jakub P. Jul 24 '15 at 10:42
  • I guess my question title is a bit of a misnomer, too. I wasn't looking for a way of counting the intersections, but the number of elements in the intersections :) – SiKiHe Jul 24 '15 at 11:42
  • Well, my response does give you both the size of each intersection (column `count`) and the number of non-empty intersections (=count of rows in the data.frame). – Jakub P. Jul 27 '15 at 09:53
0

Here's an attempt:

list1 <- c("a","b","c","e")
list2 <- c("a","b","c","e")
list3 <- c("a","b")
list4 <- c("a","b","g","h")
list_names <- c("list1","list2","list3","list4")

lapply(1:length(list_names),function(y){
combinations <- combn(list_names,y)
res<-as.list(apply(combinations,2,function(x){
    if(length(x)==1){
            p <- setdiff(get(x),unlist(sapply(setdiff(list_names,x),get)))
        }

    else if(length(x) < length(list_names)){
            p <- setdiff(Reduce(intersect,lapply(x,get)),Reduce(union,sapply(setdiff(list_names,x),get)))
        }

    else p <- Reduce(intersect,lapply(x,get))

    if(!identical(p,character(0))) p
    else NA
}))

if(y==length(list_names)) {
        res[[1]] <- unlist(res); 
        res<-res[1]
}
names(res) <- apply(combinations,2,paste,collapse="-")
res
})

The first lapply is used to loop from 1 to the number of sets you have. Then I took all possible combinations of list names, taken y at a time. This essentially generates all of the different subareas in the Venn diagram.

For each combination, the output is the difference between the intersection of the lists in the current combination to the union of the other lists that are not in the combination.

The final result is a list of length the number of sets inputed. The first element of that list holds the unique elements in each list, the second element the unique elements in any combination of two lists etc.

NicE
  • 21,165
  • 3
  • 51
  • 68