subtotal with ddply in R

Question

I am using ddply in R and I break the data in two different ways, but I want a subtotal of both. This is the function I am using

    require(plyr)
dfx <- data.frame(
  group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
  sex = sample(c("M", "F"), size = 29, replace = TRUE),
  age = runif(n = 29, min = 18, max = 54)
)

# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfx, .(group, sex), summarize,
 mean = round(mean(age), 2),
 sd = round(sd(age), 2))

I also want to summarize (mean, sd) by group and (mean,sd) summary of the entire data set. Is there a way to include this in the same ddply?

Is it important that you get all the data in one call, enough that it's worth replicating the data as is done in the answer? What is the drawback about binding three grouped calls together? — Avraham, Jan 15 '15 at 23:04

jazzurro · Answer 1 · 2015-01-16T02:28:58.247

This is not a plyr, but a dplyr suggestion. If I am not mistaken, you want mean and sd for 1) group * sex, 2) group, and 3) entire data set. If you do not want to make your data larger, you could try something like this.

library(dplyr)

bind_rows(summarise_each(group_by(dfx, group, sex), funs(mean, sd)), 
          summarise_each(group_by(dfx, group), funs(mean, sd), age),
          summarise_each(dfx, funs(mean, sd), age))

You could have three summarise_each functions to summarise data in a way you want. Then, bind them all using bind_rows which is available in the dev version of dplyr (dplyr 0.4). If you need to modify NA, you can do that later.

#   group sex     mean        sd
#1      A   F 40.81629  9.190859
#2      A   M 34.27423 10.408674
#3      B   F 28.94309  9.002275
#4      B   M 37.70992 11.606198
#5      C   F 41.36827  8.796248
#6      C   M 38.16745  8.912859
#7      A  NA 36.72750  9.874593
#8      B  NA 34.20319 11.210715
#9      C  NA 39.76786  8.111645
#10    NA  NA 36.05086 10.192498

This is what I had in mind when I asked above; nice answer :) — Avraham, Jan 16 '15 at 02:28
@Avraham Thank you for your comment. I wanted to avoid making data large. Your comment gave me inspiration. :) — jazzurro, Jan 16 '15 at 02:32

score 1 · Accepted Answer · answered Jan 15 '15 at 22:37

You can replicate the data 4 times: - including sex and group - including sex - including group - not including any column

The columns that are not included become "all"

require(plyr)
dfx <- data.frame(
  group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
  sex = sample(c("M", "F"), size = 29, replace = TRUE),
  age = runif(n = 29, min = 18, max = 54)
)

# replicate the data not taking account of one or more attributed
dfAll <- dfx
dfAll$group <- "all"
dfAll$sex <- "all"
dfGroup <- dfx
dfGroup$group <- "all_group"
dfSex <- dfx
dfSex$group <- "all_sex"
dfToGroup <- rbind(dfx, dfGroup, dfSex, dfAll)

# Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfToGroup, .(group, sex), summarize,
      mean = round(mean(age), 2),
      sd = round(sd(age), 2))

Thanks! Was looking for an easier way to possibly automate with different types of data...but it will work for meantime — megv, Jan 15 '15 at 22:48

subtotal with ddply in R

2 Answers2