Re-organizing the data set based on all possible combinations

Question

Assuming I have a data with three individuals A, B, and C, and each of them has two characteristics, "year of school"(YS) and "number of siblings"(NS). Thus, the dataset X looks like as follows:

id <- c("A", "B", "C")
YS <- c(6, 9, 8)
NS <-c(1, 0, 3)
X <- data.frame(id, YS, NS)

Now I have to re-organize the data set based on all possible combination of A, B, and C, which means there will be 2^3-1 combinations. More precisely, the combinations are: A, B, C, AB, AC, BC, ABC, and a null combination (i.e., 2^3-1 combinations). In addition to combine the individuals, I also have to calculate the value of each characteristics for each combination. For instance, the values of YS and NS for the combination AB are 15 and 1. For another instance, the values of YS and NS for the combination ABC are 23 and 4.

I kind of understand to use the code expand.grid to generate the possible combinations, but I don't know how to combine the values of characteristics at the same time. Can anyone help? Thanks.

Ken Benoit · Answer 1 · 2015-11-17T22:17:55.940

Not very pretty or R-like, but it works. And it includes the NULL set as per the question.

# function to create the combinations and sum the elements
reorgCombs <- function(data) {
    ids <- rownames(data)
    newdata <- data.frame(comb = c("NULL", id), YS = c(0, data[, "YS"]), 
                          NS = c(0, data[, "NS"]), row.names = NULL)
    for (i in 2:nrow(data)) {
        theseCombs <- t(combn(ids, i))
        newdata <- rbind(newdata, 
                         data.frame(comb = apply(theseCombs, 1, paste0, collapse=""),
                                    YS = apply(theseCombs, 1, function(x) sum(data[x, "YS"])),
                                    NS = apply(theseCombs, 1, function(x) sum(data[x, "NS"]))))
    }
    newdata
}

# make this a numeric matrix with named dimensions
# the names will be used for lookup
X2 <- cbind(YS, NS)
rownames(X2) <- id

reorgCombs(X)
##   comb YS NS
## 1 NULL  0  0
## 1    A  6  1
## 2    B  9  0
## 3    C  8  3
## 4   AB 15  1
## 5   AC 14  4
## 6   BC 17  3
## 7  ABC 23  4

Edited with new benchmarks:

Perhaps because of the lookup table, even despite the looping it's relatively fast -- but whooped by Matthew's solution:

## Unit: relative
##    expr      min       lq     mean   median       uq       max neval
##    jota  4.479829  4.408874  4.304705  4.455843  4.335172  3.730202   100
##  pierre 11.606636 11.623717 12.743089 12.078027 11.761123 19.271072   100
##     ken  3.034247  3.015091  2.978181  3.040916  2.914744  2.755357   100
## matthew  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000   100
##   frank  4.572867  4.615341  4.590244  4.719418  4.516317  3.978101   100

Thanks for doing the benchmark. I'm surprised Matthew's wins, since it calls `combn` separately for each column. — Frank, Nov 17 '15 at 22:19
I also wonder if any of you can tell me what `x=x` stands for in the `sapply` function. Thanks! — tzu, Nov 18 '15 at 04:20

Matthew Plourde · Accepted Answer · 2015-11-17T21:50:59.540

3

This is another option with combn

all_combn <- function(x, ...)
    unlist(sapply(seq_along(x), combn, x=x, ...))

data.frame(
   id=all_combn(id, paste, collapse=''),
   YS=all_combn(YS, sum),
   NS=all_combn(NS, sum)
)

#    id YS NS
# 1   A  6  1
# 2   B  9  0
# 3   C  8  3
# 4  AB 15  1
# 5  AC 14  4
# 6  BC 17  3
# 7 ABC 23  4

edited Nov 17 '15 at 21:50

answered Nov 17 '15 at 21:18

Matthew Plourde

43,932
7
96
113

Thanks for your answer Matthew (and of course everyone who helps). I have another question in this code. What does `x=x` in the second line mean? Thank you. – tzu Nov 17 '15 at 23:15
Sorry @MatthewPlourde. Do you mind to explain the part of `sapply`? I am kind of lost there, because to me, the first element of `sapply` seems to be the names of a list, but here you use `seq_along(x)` and I don't quite get the meaning. Also, how does `x=x` function? It seems like to be used for telling `combn` what `m` is. Anyway, I will appreciate if you could explain this `sapply` part. – tzu Nov 18 '15 at 06:04
`sapply` iterates over the id lengths. First it makes all the combinations for 1 id, then 2 ids, then 3 ids, etc. You're right the values of `seq_along(x)` get passed to the `m` argument of `combn`, because the first argument, `x`, is specified in the `sapply` call. – Matthew Plourde Nov 18 '15 at 12:12

Frank · Answer 3 · 2015-11-17T23:09:07.480

Here's one way to do it in base R. First, identify combos:

n = nrow(X)
combos = do.call(rbind, lapply(seq(n), function(x){
  r = combn(n, x)
  data.frame( r = c(r), g = paste(x, c(col(r)), sep=".") )
}))

Then, select rows of X for each combo:

Xc    = X[combos$r,]
Xc$id = as.character(Xc$id)
Xc$g  = ave(Xc$id, combos$g, FUN = function(x) paste0(x,collapse=''))

Finally, aggregate for each combo:

aggregate(cbind(YS,NS)~g, Xc, sum)

#     g YS NS
# 1   A  6  1
# 2  AB 15  1
# 3 ABC 23  4
# 4  AC 14  4
# 5   B  9  0
# 6  BC 17  3
# 7   C  8  3

You're missing the empty set this way, but that's easy enough to rbind on if desired.

1

I appreciate your help @Frank! – tzu Nov 18 '15 at 04:18

score 0 · Answer 4 · answered Nov 17 '15 at 19:10

It looks like a lot but I happened to be using splitstackshape for another answer and saw an application possibility here. The first call is lst1 <- do.call(c, "all combinations"). This creates the list as you mentioned of all possibilities. You can add the edge cases like NULL later if you would like. We create a dataframe from lst1 to organize the information. The function cSplit reshapes df to long. We merge to add in the number values. Finally, with dplyr we group by the index column we created, turning any factors to type integer, then find the sums:

library(dplyr)
library(splitstackshape)

lst1 <- do.call(c, lapply(1:3, function(i) combn(id, i, simplify=F)))
df <- data.frame(indx=seq_along(lst1), combs=sapply(lst1, toString))
df.long <- cSplit(df, 'combs', direction="long")

m <- merge(X, df.long, by.x='id', by.y='combs')
m %>% group_by(indx) %>%
  mutate_each(funs(as.integer(as.character(.))), -id) %>%
  summarise(id=toString(id), YS=sum(YS), NS=sum(NS))
# Source: local data frame [7 x 4]
# 
#    indx      id    YS    NS
#   (int)   (chr) (int) (int)
# 1     1       A     6     1
# 2     2       B     9     0
# 3     3       C     8     3
# 4     4    A, B    15     1
# 5     5    A, C    14     4
# 6     6    B, C    17     3
# 7     7 A, B, C    23     4

Re-organizing the data set based on all possible combinations

4 Answers4