Split a dataframe into all possible combinations of dataframes by 3 columns in R

Question

I need to receive all possible dataframes from the split of an original dataframe into all possible combinations of 3 columns. And all dataframes must contain id column. I'm at a dead end and do not know how to save all possible dataframes so that it will be possible to work further with all of them. One of the idea is to save them to list. But still I don’t know how to bind all necessary columns together. I find a close question to mine but it is still very different. Besides original dataframe has more than 1 mln rows and about 20 columns, so it is reasonable to use data.table.

frame <- data.frame(id = letters[seq( from = 1, to = 10 )], 
                    a = rnorm(10, 4), b = rnorm(10, 6), c=rnorm(10, 5),
                    d = rnorm(10, 2))

combos <- data.table(combn(colnames(frame[,-1]), 3))
combos <- data.table(t(rbind(combos, t(rep(colnames(output2[,1]), ncol(combos))))))
names(combos) <- c('category_1', 'category_2', 'category_3', 'id')

list_tables <- apply(combos, 1, as.list)

Guys, I will appreciate any help. Thanks in advance

Do you mean that you want all possible combinations of 3 columns, along with the ID column? So, in your given example you would want `id,a,b,c`; `id,a,b,d`; `id,b,c,d`? Or does order matter? Or something else? — Gregor Thomas, May 30 '18 at 22:16
Btw, your sample code doesn't run because we don't have `output2`... — Gregor Thomas, May 30 '18 at 22:16
Also, do you really need all of them at once? Or is it enough to have a nice little look-up table so easily get any one of them at a time. Because if you have 20 columns, `choose(20, 3)` is 1140 possibilities, and that's a whole lot of copies of your data to have in memory. With 1M rows, you'd need a pretty hefty computer... — Gregor Thomas, May 30 '18 at 22:20
It very much sounds like the XY problem though, what is your broader goal ? — moodymudskipper, May 30 '18 at 23:07
Oh, you are right. There should be not `output2` but `frame`. And order doesn't matter. But there should be no repetitions like `id,a,b,c`; `id,b,a,c`. — iomedee, May 31 '18 at 09:07

score 2 · Answer 1 · answered May 30 '18 at 22:18

Please see the comments to your OP re sample data and expected output. That aside, perhaps you can do something like this?

lapply(as.data.frame(combn(ncol(frame) - 1, 3)), function(idx)
    frame[, c(1, idx + 1)])
#$V1
#   id        a        b        c
#1   a 5.434201 6.342768 5.140709
#2   b 3.922708 7.572425 4.147767
#3   c 4.739137 5.253265 6.903397
#4   d 2.241395 6.306650 3.351814
#5   e 3.930175 4.569514 5.759625
#6   f 4.451906 7.194427 5.062291
#7   g 2.041634 5.517932 4.610969
#8   h 3.998476 7.317862 5.636666
#9   i 3.734664 4.870168 4.132215
#10  j 5.563223 5.073649 5.098734
#
#$V2
#   id        a        b         d
#1   a 5.434201 6.342768 1.3168256
#2   b 3.922708 7.572425 2.2410894
#3   c 4.739137 5.253265 2.5894319
#4   d 2.241395 6.306650 1.0693751
#5   e 3.930175 4.569514 2.2974619
#6   f 4.451906 7.194427 5.1372771
#7   g 2.041634 5.517932 0.9724653
#8   h 3.998476 7.317862 3.9418028
#9   i 3.734664 4.870168 1.7220438
#10  j 5.563223 5.073649 1.7784112
#
#$V3
#   id        a        c         d
#1   a 5.434201 5.140709 1.3168256
#2   b 3.922708 4.147767 2.2410894
#3   c 4.739137 6.903397 2.5894319
#4   d 2.241395 3.351814 1.0693751
#5   e 3.930175 5.759625 2.2974619
#6   f 4.451906 5.062291 5.1372771
#7   g 2.041634 4.610969 0.9724653
#8   h 3.998476 5.636666 3.9418028
#9   i 3.734664 4.132215 1.7220438
#10  j 5.563223 5.098734 1.7784112
#
#$V4
#   id        b        c         d
#1   a 6.342768 5.140709 1.3168256
#2   b 7.572425 4.147767 2.2410894
#3   c 5.253265 6.903397 2.5894319
#4   d 6.306650 3.351814 1.0693751
#5   e 4.569514 5.759625 2.2974619
#6   f 7.194427 5.062291 5.1372771
#7   g 5.517932 4.610969 0.9724653
#8   h 7.317862 5.636666 3.9418028
#9   i 4.870168 4.132215 1.7220438
#10  j 5.073649 5.098734 1.7784112

Sample data

set.seed(2017);
frame <- data.frame(id = letters[seq( from = 1, to = 10 )],
                    a = rnorm(10, 4), b = rnorm(10, 6), c=rnorm(10, 5),
                    d = rnorm(10, 2))

Best to always use a fixed seed when providing random sample data.

score 1 · Answer 2 · answered May 30 '18 at 22:25

I'd recommend not generating all the data into a list. Just generate a matrix of column name combinations (like what you're doing) and use them one-at-a-time:

combos = combn(colnames(frame[,-1]), 3)
combos = rbind("id", combos)

Then you just use the ith column of combos to subset frame on demand.

# first combo
frame[combos[, 1]]
# hundred and third combo
frame[combos[, 103]]
# etc.

It will be good to have frame be a data.table, but keeping combos as a matrix will be simpler and more efficient.

Split a dataframe into all possible combinations of dataframes by 3 columns in R

2 Answers2

Sample data

Linked