I have searched for similar questions, but cannot find the exact solution required. This question is somewhat similar, but only deals with the issue of summarising multiple continuous variables, not factors.
I have a dataframe consisting of 4 factor variables (sex
, agegroup
, hiv
, group
), e.g.
set.seed(20150710)
df<-data.frame(sex=as.factor(c(sample(1:2, 10000, replace=T))),
agegroup=as.factor(c(sample(1:5,10000, replace=T))),
hiv=as.factor(c(sample(1:3,10000, replace=T))),
group=as.factor(c(sample(1:2,10000, replace=T)))
)
levels(df$sex)<- c("Male", "Female")
levels(df$agegroup)<- c("16-24", "25-34", "35-44", "45-54", "55+")
levels(df$hiv)<-c("Positive", "Negative", "Not tested")
levels(df$group)<-c("Intervention", "Control")
I would like to create a summary table giving counts and proportions for each level of the exposure variables sex
, agegroup
and hiv
, stratified by group
.
EDIT: this is what I am aiming for:
X N_Control Percent_Control N_Intervention Percent_Intervention
1 sex_Female 2517 0.5041057 2480 0.4953066
2 sex_Male 2476 0.4958943 2527 0.5046934
3 agegroup_16-24 1005 0.2012818 992 0.1981226
4 agegroup_25-34 1001 0.2004807 996 0.1989215
5 agegroup_35-44 1010 0.2022832 997 0.1991212
6 agegroup_45-54 976 0.1954737 996 0.1989215
7 agegroup_55+ 1001 0.2004807 1026 0.2049131
8 hiv_Negative 1679 0.3362708 1642 0.3279409
9 hiv_Not tested 1633 0.3270579 1660 0.3315359
10 hiv_Positive 1681 0.3366713 1705 0.3405233
But I cannot get it to work with summarise_each
in dplyr; only overall variable counts and proportions, and not for each factor level, are given:
df.out<-df %>%
group_by(group) %>%
summarise_each(funs(N=n(), Percent=n()/sum(n())), sex, agegroup, hiv)
print(df.out)
group sex_N agegroup_N hiv_N sex_Percent agegroup_Percent hiv_Percent
1 1 4973 4973 4973 1 1 1
2 2 5027 5027 5027 1 1 1
Finally, is there some way to reshape the table (e.g. using tidyr), so that the exposure variables (sex, agegroup, hiv) are reported as rows?
Thanks