8

This is a basic problem in data analysis which Stata deals with in one step.

Create a wide data frame with time invariant data (x0) and time varying data for years 2000 and 2005 (x1,x2):

d1 <- data.frame(subject = c("id1", "id2"),  
x0 = c("male", "female"),  
x1_2000 = 1:2,   
x1_2005 = 5:6,  
x2_2000 = 1:2,  
x2_2005 = 5:6    
) 

s.t.

subject x0 x1_2000 x1_2005 x2_2000 x2_2005  
1     id1 male         1       5       1       5  
2     id2 female       2       6       2       6  

I want to shape it like a panel so data looks like this:

        subject     x0 time x1 x2
1     id1   male 2000  1  1
2     id2 female 2000  2  2
3     id1   male 2005  5  5
4     id2 female 2005  6  6

I can do this with reshape s.t.

d2 <-reshape(d1, 
idvar="subject",
varying=list(c("x1_2000","x1_2005"),
    c("x2_2000","x2_2005")),
    v.names=c("x1","x2"),
    times = c(2000,2005),
    direction = "long",
    sep= "_")

My main concern is that when you have dozens of variables the above command gets very long. In stata one would simply type:

reshape long x1 x2, i(subject) j(year)

Is there such a simple solution in R?

Metrics
  • 15,172
  • 7
  • 54
  • 83
Fred
  • 1,833
  • 3
  • 24
  • 29

2 Answers2

12

reshape can guess many of its arguments. In this case it's sufficient to specify the following. No packages are used.

 reshape(d1, dir = "long", varying = 3:6, sep = "_")

giving:

       subject     x0 time x1 x2 id
1.2000     id1   male 2000  1  1  1
2.2000     id2 female 2000  2  2  2
1.2005     id1   male 2005  5  5  1
2.2005     id2 female 2005  6  6  2
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Nice! But what happens when variables are named _sample_1_2000_ etc... Can one be more spjhisticated with the `sep=` option? – Fred Mar 14 '11 at 14:05
  • 2
    @Fred, Use the `split` argument in place of `sep`, i.e. `reshape(d1, dir = "long", varying = 3:6, split = list(regexp = "_2", include = TRUE))` or reduce this case to the one in the question, i.e. `reshape(setNames(d1, sub("sample_", "", names(d1))), dir = "long", varying = 3:6, sep = "_")` – G. Grothendieck Mar 14 '11 at 16:35
4

here is a brief example using reshape2 package:

library(reshape2)
library(stringr)

# it is always useful to start with melt
d2 <- melt(d1, id=c("subject", "x0"))

# redefine the time and x1, x2, ... separately
d2 <- transform(d2, time = str_replace(variable, "^.*_", ""),
                    variable = str_replace(variable, "_.*$", ""))

# finally, cast as you want
d3 <- dcast(d2, subject+x0+time~variable)

now you don't need even specifying x1 and x2.
This code works if variables increase:

> d1 <- data.frame(subject = c("id1", "id2"), x0 = c("male", "female"),
+ x1_2000 = 1:2,
+ x1_2005 = 5:6,
+ x2_2000 = 1:2,
+ x2_2005 = 5:6,
+ x3_2000 = 1:2,
+ x3_2005 = 5:6,
+ x4_2000 = 1:2,
+ x4_2005 = 5:6
+ ) 
> 
> d2 <- melt(d1, id=c("subject", "x0"))
> d2 <- transform(d2, time = str_replace(variable, "^.*_", ""),
+                     variable = str_replace(variable, "_.*$", ""))
> 
> d3 <- dcast(d2, subject+x0+time~variable)
> 
> d3
  subject     x0 time x1 x2 x3 x4
1     id1   male 2000  1  1  1  1
2     id1   male 2005  5  5  5  5
3     id2 female 2000  2  2  2  2
4     id2 female 2005  6  6  6  6
kohske
  • 65,572
  • 8
  • 165
  • 155
  • Thanks, this is useful. Not very clear what `transform` is doing (help file not very helpful) nor how to interpret `"^.*_"` and `"_.*$"`. I ask because some vars are actually named "sample_1_2000" etc... – Fred Mar 13 '11 at 03:34
  • Maybe this is more intuitive with year data named "sample_1_2000" and so on: `temp1 <- transform(temp, time = str_sub(variable, -4), variable = str_sub(variable, 1,str_length(variable)-5))` – Fred Mar 13 '11 at 03:52
  • @Fred it depends on the format of variable name. if the length of (at least some part of) character is fixed, your way is easier. otherwise, regular expression is more flexible. – kohske Mar 13 '11 at 05:05