I have a big data frame (+239k observations on 57 variables) with some sickness descriptions and medicines administered to those sicknesses for people in different age ranges. I'd like to find those medicines in the top quartile of frequency use for each sickness description.
To make a reproducible example, I created a 1000 observations data frame:
set.seed(1);sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T));md<-as.factor(sample(c("med 1","med 2","med 3","med 4","med 5")));age<-as.factor(sample(c("group a","group b","group c"),1000,replace=T))
df<-data.frame(obs=1:1000,md=md,sk=sk,age=age)
I can produce a table of frequencies with
xt<-xtabs(~md+sk+age,df)
I can then produce a data frame for each age group
XTDF_a<-as.data.frame(xt[,,"group a"])
and then find the 3rd quartile of frequencies of each sickness with:
Q3_a<-apply(XTDF_a,2,function(x) quantile(x,probs = .75))
to which I can compare and obtain which Medicines are over the 3rd quartile for each Sickness
XTDF_a>Q3_a
sk
md sick A sick B sick C sick D
med 1 FALSE FALSE TRUE FALSE
med 2 FALSE FALSE FALSE FALSE
med 3 TRUE TRUE FALSE FALSE
med 4 FALSE FALSE FALSE TRUE
med 5 FALSE FALSE FALSE FALSE
I can conclude that med 3
is the top selection for Sickness A, and so on (I'm acutally looping to extract that information). I then go back and repeat the process for group b, c.... which is almost impossible with the size of data I have (sicknesses are about 4200 levels and medicines are about 1150 levels).
I'm pretty sure there should be a different, easier way to achieve this. I'd appreciate a hint on a better path to follow.