1

I'm trying to reshape my data from a long format into a wide format based on multiple groupings, without success. with this data:

id <- 1:20
month <- rep(4:7, 50)
name <- rep(c("sam", "mike", "tim", "jill", "max"), 40)
cost <- sample(1:100, 200, replace=TRUE)
df <- data.frame(id, month, name, cost)

df.mo.mean <- aggregate(df$cost ~ df$name + df$month, FUN="mean")
df.mo.sd <- aggregate(df$cost ~ df$name + df$month, FUN="sd")

df.mo <- data.frame(df.mo.mean, df.mo.sd)
df.mo <- df.mo[,-c(4,5)]
df.mo[3:4] <- round(df.mo[3:4],2)

head(df)
   id month name cost
1  1     4  sam   29
2  2     5 mike   93
3  3     6  tim   27
4  4     7 jill   67
5  5     4  max   28
6  6     5  sam   69

I'm trying to get my data to look like something below, and try to generalize it for an unknown number of names (but <15 max)

month    name1.cost.mean  name1.cost.sd  name2.cost.mean  name2.cost.sd
1        45               4              40               6
2        ...   

I've tried reshape and do.call with rbind without success. The only other way I can think of doing it is with a loop, which means I'm doing something wrong. I dont have any experience with plyr and would prefer to solve this problem with base packages (for learning purposes), but if its not possible any other suggestions would be very helpful

Arun
  • 116,683
  • 26
  • 284
  • 387
ano
  • 704
  • 5
  • 15

4 Answers4

4
set.seed(1)
 library(plyr)
 kk<-ddply(df,.(month,name),summarize,mean=mean(cost),sd=sd(cost))
 reshape(kk,timevar="name",idvar="month",direction="wide")



    month mean.jill  sd.jill mean.max   sd.max mean.mike  sd.mike mean.sam   sd.sam mean.tim   sd.tim
1      4      55.3 34.62834     63.3 23.35261      57.6 22.91627     63.4 28.89906     43.3 25.42112
6      5      49.3 25.00689     51.1 27.85059      48.4 23.16223     43.0 24.33562     47.6 32.13928
11     6      60.4 23.61826     52.1 29.74503      38.6 34.39703     53.0 23.28567     52.4 20.88700
16     7      50.0 30.76073     62.7 23.98634      51.7 32.10763     52.8 32.27589     49.5 23.00845
Metrics
  • 15,172
  • 7
  • 54
  • 83
  • I was just messing with reshape for this. It works, but just curious as to why name would be a timevar instead of month? – tcash21 Oct 19 '13 at 23:09
  • 2
    With timevar, we denote the variable that will define the multiple observations per month. – Metrics Oct 19 '13 at 23:14
  • I think there is an error in the reshape, `mean.jill` for month `4` is `39.5`. You can see it also from your `kk` – Michele Oct 19 '13 at 23:16
  • then you are using a different `df`. Of course, there is a `sample` function in there... – Michele Oct 19 '13 at 23:21
  • I think its a result of the way I generated the cost variable, being a random sample. sorry, i didnt specify a set.seed to ensure everyone had the same data – ano Oct 19 '13 at 23:23
  • @ano I know, also my bad cause I just copied and pasted without paying so much attantion... you have plenty of options anyway :) – Michele Oct 19 '13 at 23:24
  • when applying to my actual data, came across this error: `Error in attributes(out) <- attributes(col) : 'names' attribute [9] must be the same length as the vector [4]` I think it is because there is some missing data (certain people have no observations for some months)...any suggestions? – ano Oct 19 '13 at 23:51
  • @ ano: It should be fine. It will give you NA for the name with observations on month. – Metrics Oct 20 '13 at 00:10
  • 1
    aha! found my problem: [solution](http://stackoverflow.com/questions/14153092/meaning-of-ddply-error-names-attribute-9-must-be-the-same-length-as-the-vec) – ano Oct 20 '13 at 00:16
1

I'm not sure what you are asking for, but maybe something like this could be useful

> set.seed(1)
> df <- data.frame(id=1:20, month=rep(4:7, 50), 
+                  name=rep(c("sam", "mike", "tim", "jill", "max"), 40),
+                  cost= sample(1:100, 200, replace=TRUE))
> 
> DF.mean <- aggregate(cost ~ name + month, FUN=mean, data=df)  ## mean
> DF.sd   <- aggregate(cost ~ name + month, FUN=sd, data=df)    ## sd
> 
> x1 <- as.data.frame.matrix(xtabs(cost~month+name, data=DF.mean)) # reshaping mean
> colnames(x1) <- paste0(colnames(x1), ".mean")
> x2 <- as.data.frame.matrix(xtabs(cost~month+name, data=DF.sd))   # reshaping sd
> colnames(x2) <- paste0(colnames(x2), ".sd")
> 
> cbind(x1, x2)
  jill.mean max.mean mike.mean sam.mean tim.mean  jill.sd   max.sd  mike.sd   sam.sd   tim.sd
4      55.3     63.3      57.6     63.4     43.3 34.62834 23.35261 22.91627 28.89906 25.42112
5      49.3     51.1      48.4     43.0     47.6 25.00689 27.85059 23.16223 24.33562 32.13928
6      60.4     52.1      38.6     53.0     52.4 23.61826 29.74503 34.39703 23.28567 20.88700
7      50.0     62.7      51.7     52.8     49.5 30.76073 23.98634 32.10763 32.27589 23.00845

Also, note that @Metrics approach can be done using R base functions without any extra packages:

> kk <- aggregate(cost ~ name + month, FUN=function(x) c(mean=mean(x), sd=sd(x)), data=df)
> reshape(kk,timevar="name",idvar="month",direction="wide")
  month cost.jill.mean cost.jill.sd cost.max.mean cost.max.sd cost.mike.mean cost.mike.sd cost.sam.mean cost.sam.sd cost.tim.mean cost.tim.sd
1      4       55.30000     34.62834      63.30000    23.35261       57.60000     22.91627      63.40000    28.89906      43.30000    25.42112
6      5       49.30000     25.00689      51.10000    27.85059       48.40000     23.16223      43.00000    24.33562      47.60000    32.13928
11     6       60.40000     23.61826      52.10000    29.74503       38.60000     34.39703      53.00000    23.28567      52.40000    20.88700
16     7       50.00000     30.76073      62.70000    23.98634       51.70000     32.10763      52.80000    32.27589      49.50000    23.00845
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
1
> means <- with( df, tapply(cost, list(month, name), FUN=mean) )
> sds <- with( df, tapply(cost, list(month, name), FUN=sd) )
> colnames(means) <- paste0(colnames(means), ".mean")
> colnames(sds) <- paste0(colnames(sds), ".sd")
> comb.df <- as.data.frame( cbind(means, sds) )
> comb.df <- comb.df[order(names(comb.df))]
> comb.df
  jill.mean jill.mean.sd max.mean max.mean.sd mike.mean mike.mean.sd
4      62.1     22.29823     39.7    25.53016      39.6     30.11164
5      40.7     30.72838     44.4    29.12502      54.2     23.91095
6      47.3     31.54556     46.9    32.30910      65.3     30.05569
7      55.5     33.16038     45.9    28.13637      59.7     31.79815
  sam.mean sam.mean.sd tim.mean tim.mean.sd
4     40.9    23.54877     58.5    21.69613
5     51.5    30.76163     34.2    32.16900
6     69.1    18.26016     55.2    32.99764
7     46.9    29.90150     55.8    27.17352
IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

You can use two reshape and then merge the results

library(reshape2)
> dcast(df, month ~ name, mean, value.var="cost")
  month jill  max mike  sam  tim
1     4 39.5 54.6 45.6 48.4 57.4
2     5 45.1 61.7 45.4 54.5 50.8
3     6 41.9 45.7 56.4 43.1 52.1
4     7 51.6 38.6 43.6 65.1 51.5

> dcast(df, month ~ name, sd, value.var="cost")
  month     jill      max     mike      sam      tim
1     4 29.31154 25.25954 28.96051 31.32695 29.82989
2     5 31.02848 27.96049 34.32589 30.08599 23.95273
3     6 32.09517 32.50316 37.16988 27.03681 30.42094
4     7 19.56300 31.50026 28.65969 36.53750 26.73429
Michele
  • 8,563
  • 6
  • 45
  • 72