2

I am using ddply in R and I break the data in two different ways, but I want a subtotal of both. This is the function I am using

    require(plyr)
dfx <- data.frame(
  group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
  sex = sample(c("M", "F"), size = 29, replace = TRUE),
  age = runif(n = 29, min = 18, max = 54)
)

# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfx, .(group, sex), summarize,
 mean = round(mean(age), 2),
 sd = round(sd(age), 2))

I also want to summarize (mean, sd) by group and (mean,sd) summary of the entire data set. Is there a way to include this in the same ddply?

megv
  • 1,421
  • 5
  • 24
  • 36
  • 3
    Please provide a reproducible example including `data`. – lukeA Jan 15 '15 at 22:03
  • Is it important that you get all the data in one call, enough that it's worth replicating the data as is done in the answer? What is the drawback about binding three grouped calls together? – Avraham Jan 15 '15 at 23:04

2 Answers2

3

This is not a plyr, but a dplyr suggestion. If I am not mistaken, you want mean and sd for 1) group * sex, 2) group, and 3) entire data set. If you do not want to make your data larger, you could try something like this.

library(dplyr)

bind_rows(summarise_each(group_by(dfx, group, sex), funs(mean, sd)), 
          summarise_each(group_by(dfx, group), funs(mean, sd), age),
          summarise_each(dfx, funs(mean, sd), age))

You could have three summarise_each functions to summarise data in a way you want. Then, bind them all using bind_rows which is available in the dev version of dplyr (dplyr 0.4). If you need to modify NA, you can do that later.

#   group sex     mean        sd
#1      A   F 40.81629  9.190859
#2      A   M 34.27423 10.408674
#3      B   F 28.94309  9.002275
#4      B   M 37.70992 11.606198
#5      C   F 41.36827  8.796248
#6      C   M 38.16745  8.912859
#7      A  NA 36.72750  9.874593
#8      B  NA 34.20319 11.210715
#9      C  NA 39.76786  8.111645
#10    NA  NA 36.05086 10.192498
jazzurro
  • 23,179
  • 35
  • 66
  • 76
1

You can replicate the data 4 times: - including sex and group - including sex - including group - not including any column

The columns that are not included become "all"

require(plyr)
dfx <- data.frame(
  group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
  sex = sample(c("M", "F"), size = 29, replace = TRUE),
  age = runif(n = 29, min = 18, max = 54)
)

# replicate the data not taking account of one or more attributed
dfAll <- dfx
dfAll$group <- "all"
dfAll$sex <- "all"
dfGroup <- dfx
dfGroup$group <- "all_group"
dfSex <- dfx
dfSex$group <- "all_sex"
dfToGroup <- rbind(dfx, dfGroup, dfSex, dfAll)

# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfToGroup, .(group, sex), summarize,
      mean = round(mean(age), 2),
      sd = round(sd(age), 2))
Michele Usuelli
  • 1,970
  • 13
  • 15
  • Thanks! Was looking for an easier way to possibly automate with different types of data...but it will work for meantime – megv Jan 15 '15 at 22:48