0

I have a dataset with one (or more) id variables and many subsets of variables which have the same structure. I would like to stack these subsets, in a long format : I have three subsets in my example, so the final table must have three times more rows, and the id variable has to be triplicated.

The foo_data table is an example of my data in R :

individuals <-c("individual1","individual2","individual3")
subset1.var1 <- c("value1","value2","value3")
subset1.var2 <- c("value4","value5","value6")
subset1.var3 <- c("value7","value8","value9")

subset2.var1 <- c("value10","value11","value12")
subset2.var2 <- c("value13","value14","value15")
subset2.var3 <- c("value16","value17","value18")

subset3.var1 <- c("value19","value20","value21")
subset3.var2 <- c("value22","value23","value24")
subset3.var3 <- c("value25","value26","value27")

foo_data <-data.frame(individuals,subset1.var1,subset1.var2,subset1.var3,subset2.var1,
subset2.var2,subset2.var3,subset3.var1,subset3.var2,subset3.var3)

foo_data

And this is what I would like to have :

structure(list(id = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L), .Label = c("individual1", "individual2", "individual3"), class = "factor"), 
    var1 = structure(1:9, .Label = c("value1", "value2", "value3", 
    "value10", "value11", "value12", "value19", "value20", "value21"
    ), class = "factor"), var2 = structure(1:9, .Label = c("value4", 
    "value5", "value6", "value13", "value14", "value15", "value22", 
    "value23", "value24"), class = "factor"), var3 = structure(1:9, .Label = c("value7", 
    "value8", "value9", "value16", "value17", "value18", "value25", 
    "value26", "value27"), class = "factor")), .Names = c("id", 
"var1", "var2", "var3"), row.names = c(NA, 9L), class = "data.frame")

#                     individuals    time    var1    var2    var3
# individual1.subset1 individual1 subset1  value1  value4  value7
# individual2.subset1 individual2 subset1  value2  value5  value8
# individual3.subset1 individual3 subset1  value3  value6  value9
# individual1.subset2 individual1 subset2 value10 value13 value16
# individual2.subset2 individual2 subset2 value11 value14 value17
# individual3.subset2 individual3 subset2 value12 value15 value18
# individual1.subset3 individual1 subset3 value19 value22 value25
# individual2.subset3 individual2 subset3 value20 value23 value26
# individual3.subset3 individual3 subset3 value21 value24 value27

The most obvious solution would be to compute a loop, like this one :

stacked_foo <- foo_data[,c(1:4)]
noms <- c("id","var1","var2","var3")
names(stacked_foo) <- noms
for (i in 2:3){
  c(1,(3*i-1):(3*(i+1)-2))
  temp <- foo_data[,c(1,(3*i-1):(3*(i+1)-2))]
  names(temp) <- noms
  stacked_foo <- rbind(stacked_foo,temp)
  print(i)
}

But isn't there something more fast, more concise, with a built in function like melt or stack ?

Infinite Recursion
  • 6,511
  • 28
  • 39
  • 51
Giuliano
  • 323
  • 3
  • 10

1 Answers1

4

Option 1: Base R's reshape function

This is a task for reshape (from base R), but it wants your data.frame names to be in a slightly different format.

Try this:

names(foo_data) <- gsub("(.*)\\.(.*)", "\\2.\\1", names(foo_data))
names(foo_data)
#  [1] "individuals"  "var1.subset1" "var2.subset1" "var3.subset1" "var1.subset2"
#  [6] "var2.subset2" "var3.subset2" "var1.subset3" "var2.subset3" "var3.subset3"

If your names are in this format, you can just do:

reshape(foo_data, direction = "long", idvar = "individuals", 
        varying = 2:ncol(foo_data))
#                     individuals    time    var1    var2    var3
# individual1.subset1 individual1 subset1  value1  value4  value7
# individual2.subset1 individual2 subset1  value2  value5  value8
# individual3.subset1 individual3 subset1  value3  value6  value9
# individual1.subset2 individual1 subset2 value10 value13 value16
# individual2.subset2 individual2 subset2 value11 value14 value17
# individual3.subset2 individual3 subset2 value12 value15 value18
# individual1.subset3 individual1 subset3 value19 value22 value25
# individual2.subset3 individual2 subset3 value20 value23 value26
# individual3.subset3 individual3 subset3 value21 value24 value27

You can remove the funky rownames with rownames(your-new-df-name) <- NULL.


Option 2: melt and dcast from "reshape2"

If you wanted to go the reshape2 route, you can try (this is assuming the column names have not been changed):

library(reshape2)
dfL <- melt(foo_data, id.vars = "individuals") ## Warning ahead, but OK
# Warning message:
# attributes are not identical across measure variables; they will be dropped 
X <- colsplit(dfL$variable, c("SUB", "VAR"), pattern = "\\.")
dcast(dfL, individuals + X$SUB ~ X$VAR, value.var = "value")
#   individuals   X$SUB    var1    var2    var3
# 1 individual1 subset1  value1  value4  value7
# 2 individual1 subset2 value10 value13 value16
# 3 individual1 subset3 value19 value22 value25
# 4 individual2 subset1  value2  value5  value8
# 5 individual2 subset2 value11 value14 value17
# 6 individual2 subset3 value20 value23 value26
# 7 individual3 subset1  value3  value6  value9
# 8 individual3 subset2 value12 value15 value18
# 9 individual3 subset3 value21 value24 value27

Option 3: "tidyr" + "dplyr"

However, it seems like people are starting to look at tidyr + dplyr as the next big thing for reshaping data. I'm not sold yet on all the piping stuff, but the approach would be:

library(dplyr)
# devtools::install_github("hadley/tidyr")
library(tidyr)
foo_data %>%
  gather(vars, vals, subset1.var1:subset3.var3) %>%
  separate(vars, into = c("subset", "var")) %>%
  spread(var, vals)
#   individuals  subset    var1    var2    var3
# 1 individual1 subset1  value1  value4  value7
# 2 individual1 subset2 value10 value13 value16
# 3 individual1 subset3 value19 value22 value25
# 4 individual2 subset1  value2  value5  value8
# 5 individual2 subset2 value11 value14 value17
# 6 individual2 subset3 value20 value23 value26
# 7 individual3 subset1  value3  value6  value9
# 8 individual3 subset2 value12 value15 value18
# 9 individual3 subset3 value21 value24 value27
# Warning message:
# attributes are not identical across measure variables; they will be dropped 
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks for your answers. They work well. The sole problem is that your solutions manipulate the names of the columns, whereas I was mere looking for something like : I give the number of column of the subset (each cubset has the same numer of columns) and I give the number of id columns as input of a function that computes the whole operation. But renaming the columns of my dataset is not a big deal ! Thank you ! – Giuliano Jul 17 '14 at 13:18