Summarising levels of multiple factor variables

Question

I have searched for similar questions, but cannot find the exact solution required. This question is somewhat similar, but only deals with the issue of summarising multiple continuous variables, not factors.

I have a dataframe consisting of 4 factor variables (sex, agegroup, hiv, group), e.g.

set.seed(20150710)
df<-data.frame(sex=as.factor(c(sample(1:2, 10000, replace=T))), 
           agegroup=as.factor(c(sample(1:5,10000, replace=T))),
           hiv=as.factor(c(sample(1:3,10000, replace=T))),
           group=as.factor(c(sample(1:2,10000, replace=T)))
           )

levels(df$sex)<- c("Male", "Female")
levels(df$agegroup)<- c("16-24", "25-34", "35-44", "45-54", "55+")
levels(df$hiv)<-c("Positive", "Negative", "Not tested")
levels(df$group)<-c("Intervention", "Control")

I would like to create a summary table giving counts and proportions for each level of the exposure variables sex, agegroup and hiv, stratified by group.

EDIT: this is what I am aiming for:

                X N_Control Percent_Control N_Intervention     Percent_Intervention
1      sex_Female      2517       0.5041057           2480            0.4953066
2        sex_Male      2476       0.4958943           2527            0.5046934
3  agegroup_16-24      1005       0.2012818            992            0.1981226
4  agegroup_25-34      1001       0.2004807            996            0.1989215
5  agegroup_35-44      1010       0.2022832            997            0.1991212
6  agegroup_45-54       976       0.1954737            996            0.1989215
7    agegroup_55+      1001       0.2004807           1026            0.2049131
8    hiv_Negative      1679       0.3362708           1642            0.3279409
9  hiv_Not tested      1633       0.3270579           1660            0.3315359
10   hiv_Positive      1681       0.3366713           1705            0.3405233

But I cannot get it to work with summarise_each in dplyr; only overall variable counts and proportions, and not for each factor level, are given:

df.out<-df %>%
  group_by(group) %>%
  summarise_each(funs(N=n(), Percent=n()/sum(n())), sex, agegroup, hiv)
print(df.out)

group sex_N agegroup_N hiv_N sex_Percent agegroup_Percent hiv_Percent
1     1  4973       4973  4973           1                1           1
2     2  5027       5027  5027           1                1           1

Finally, is there some way to reshape the table (e.g. using tidyr), so that the exposure variables (sex, agegroup, hiv) are reported as rows?

Thanks

Something like `ftable(group ~ sex + hiv + agegroup, data = df)` to start with? — Roman Luštrik, Jul 15 '15 at 09:43

Jaap · Accepted Answer · 2016-02-03T17:34:04.463

Doing it in two steps will give you the desired result. First, calculate the n, then calculate the percentage by group:

library(dplyr)
df.out <- df %>%
  group_by(group, sex, agegroup, hiv) %>%
  tally() %>%
  group_by(group) %>%
  mutate(percent=n/sum(n))

A solution with data.table:

library(data.table)
dt.out <- setDT(df)[, .N, by=.(group, sex, agegroup, hiv)][, percent:=N/sum(N), by=group]

library(microbenchmark)
microbenchmark(df.out = df %>%
                 group_by(group, sex, agegroup, hiv) %>%
                 tally() %>%
                 group_by(group) %>%
                 mutate(percent=n/sum(n)),
               dt.out = df[,.N,by=.(group, sex, agegroup, hiv)][,percent:=N/sum(N),by=group])

# Unit: milliseconds
#   expr      min       lq     mean   median       uq       max neval cld
# df.out 8.299870 8.518590 8.894504 8.708315 8.931459 11.964930   100   b
# dt.out 2.346632 2.394788 2.540132 2.441777 2.551235  4.344442   100  a

Conclusion: the data.table solution is much faster (3.5x).

To get a table like you requested after the edit of your question, you can do the following:

library(data.table)

setDT(df)
dt.sex <- dcast(df[,.N, by=.(sex,group)][,percent:=N/sum(N)], sex ~ group, value.var = c("N", "percent"))
dt.age <- dcast(df[,.N, by=.(agegroup,group)][,percent:=N/sum(N)], agegroup ~ group, value.var = c("N", "percent"))
dt.hiv <- dcast(df[,.N, by=.(hiv,group)][,percent:=N/sum(N)], hiv ~ group, value.var = c("N", "percent"))

dt.out.wide <- rbindlist(list(dt.sex, dt.age, dt.hiv), use.names=FALSE)
names(dt.out.wide) <- c("X","N_Intervention","N_Control","percent_Intervention","percent_Control")

this gives:

> dt.out.wide
             X N_Intervention N_Control percent_Intervention percent_Control
 1:       Male           2454      2488               0.2454          0.2488
 2:     Female           2561      2497               0.2561          0.2497
 3:      16-24            954       991               0.0954          0.0991
 4:      25-34           1033      1002               0.1033          0.1002
 5:      35-44           1051      1000               0.1051          0.1000
 6:      45-54            983       978               0.0983          0.0978
 7:        55+            994      1014               0.0994          0.1014
 8:   Positive           1717      1664               0.1717          0.1664
 9:   Negative           1637      1659               0.1637          0.1659
10: Not tested           1661      1662               0.1661          0.1662

Many thanks - that is wonderful! May not have been completely clear with what I meant by "reshape". Have edited the question to show what I am aiming for - not sure if this is possible using dplyr/tidyr? — Peter MacPherson, Jul 15 '15 at 10:31
@PeterMacPherson Looking at the required output in your updated question, this seems odd to me. You mixing several variables into one, which is not a wise thing to do IMHO. Could you explain why you want it in this format? — Jaap, Jul 15 '15 at 10:52
No further analysis would be done on this table. Is purely to allow efficient export in my workflow for reporting study results in the standard biomedical table of baseline characteristics format. E.g. see for example Table 1 in this manuscript [link](http://www.nejm.org/doi/full/10.1056/NEJMoa1011205#t=articleResults) — Peter MacPherson, Jul 15 '15 at 11:06

Summarising levels of multiple factor variables

1 Answers1