3

I have a big data frame (+239k observations on 57 variables) with some sickness descriptions and medicines administered to those sicknesses for people in different age ranges. I'd like to find those medicines in the top quartile of frequency use for each sickness description.

To make a reproducible example, I created a 1000 observations data frame:

set.seed(1);sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T));md<-as.factor(sample(c("med 1","med 2","med 3","med 4","med 5")));age<-as.factor(sample(c("group a","group b","group c"),1000,replace=T))
df<-data.frame(obs=1:1000,md=md,sk=sk,age=age)

I can produce a table of frequencies with

xt<-xtabs(~md+sk+age,df)

I can then produce a data frame for each age group

XTDF_a<-as.data.frame(xt[,,"group a"])

and then find the 3rd quartile of frequencies of each sickness with:

Q3_a<-apply(XTDF_a,2,function(x) quantile(x,probs = .75))

to which I can compare and obtain which Medicines are over the 3rd quartile for each Sickness

XTDF_a>Q3_a


    sk
md      sick A sick B sick C sick D
  med 1  FALSE  FALSE   TRUE  FALSE
  med 2  FALSE  FALSE  FALSE  FALSE
  med 3   TRUE   TRUE  FALSE  FALSE
  med 4  FALSE  FALSE  FALSE   TRUE
  med 5  FALSE  FALSE  FALSE  FALSE

I can conclude that med 3 is the top selection for Sickness A, and so on (I'm acutally looping to extract that information). I then go back and repeat the process for group b, c.... which is almost impossible with the size of data I have (sicknesses are about 4200 levels and medicines are about 1150 levels).

I'm pretty sure there should be a different, easier way to achieve this. I'd appreciate a hint on a better path to follow.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • I'm a little confused by what you are attempting to do. When you say "over the 3rd quartile" what do you mean? Is the question essentially "for all prescriptions of this drug X, which illnesses are most commonly being treated with X?" – TARehman May 19 '15 at 17:25
  • @TARehman: what I want to find is "for any given sickness, what are the most commonly prescribred drugs?" Thus, when I said "over 3rd Q" (in frequencies) I meant those drugs that were most likely prescribed to treat a given sickness. – PavoDive May 19 '15 at 17:53
  • I think you have a methodology problem because you are using quartiles. What if the medications used for a sickness are equally distributed? Perhaps that's not an issue for your data. Anyway, I suggest you turn your table around then and put sickness as row and meds as columns, it's conceptually more clear. – TARehman May 19 '15 at 17:54
  • I appreciate your input. The requirements established that I find "the leader drugs" for reported conditions, not the other way around. That's why I organized the data in this way. – PavoDive May 19 '15 at 17:59
  • Makes sense. I have written a way you could do this using `aggregate` below. – TARehman May 19 '15 at 18:09

2 Answers2

2

apply can work on 3-d arrays, and you can specify multiple dimensions to iterate over:

> apply(xt,2:3,function(x) x > quantile(x, probs = .75))
, , age = group a

       sk
md      sick A sick B sick C sick D
  med 1  FALSE  FALSE   TRUE  FALSE
  med 2  FALSE  FALSE  FALSE  FALSE
  med 3   TRUE   TRUE  FALSE  FALSE
  med 4  FALSE  FALSE  FALSE   TRUE
  med 5  FALSE  FALSE  FALSE  FALSE

, , age = group b

       sk
md      sick A sick B sick C sick D
  med 1  FALSE  FALSE   TRUE  FALSE
  med 2  FALSE  FALSE  FALSE  FALSE
  med 3  FALSE  FALSE  FALSE  FALSE
  med 4   TRUE  FALSE  FALSE  FALSE
  med 5  FALSE   TRUE  FALSE  FALSE

, , age = group c

       sk
md      sick A sick B sick C sick D
  med 1  FALSE  FALSE  FALSE   TRUE
  med 2  FALSE  FALSE  FALSE  FALSE
  med 3  FALSE  FALSE  FALSE  FALSE
  med 4  FALSE  FALSE  FALSE  FALSE
  med 5   TRUE  FALSE  FALSE  FALSE
Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
1

I think you can speed this up by writing a bit more precise function and then using aggregate to get the results. You could also use by if you want a more list-based approach, which might be more useful for your next use. I think it will still be slow, but not as slow as looping.

# Here is what you gave me originally
set.seed(1)
sk<-as.factor(sample(c("sick A","sick B","sick C","sick D"),1000,replace=T))
md<-as.factor(sample(c("med 1","med 2","med 3","med 4","med 5")))
age<-as.factor(sample(c("group a","group b","group c"),1000,replace=T))
df<-data.frame(obs=1:1000,md=md,sk=sk,age=age)

# Define a function that basically does what you did before, but uses table()
func.get_75th_meds <- function(vector_of_meds) {

    freq <- table(vector_of_meds)
    return(names(freq)[freq >= quantile(x = freq,probs = 0.75)])
}

aggregate(x = list(Meds = df$md),
          by = list(Sickness = df$sk,Group = df$age),
          FUN = func.get_75th_meds)

   Sickness   Group                       Meds
1    sick A group a               med 3, med 5
2    sick B group a               med 3, med 5
3    sick C group a med 1, med 2, med 4, med 5
4    sick D group a               med 2, med 4
5    sick A group b               med 4, med 5
6    sick B group b        med 1, med 2, med 5
7    sick C group b               med 1, med 2
8    sick D group b               med 2, med 3
9    sick A group c               med 2, med 5
10   sick B group c               med 2, med 4
11   sick C group c        med 1, med 2, med 4
12   sick D group c        med 1, med 3, med 4

EDITED TO ADD: Here's the alternative with by() using the same function.

by(data = df$md,
   INDICES = list(Sickness = df$sk,Group = df$age),
   FUN = func.get_75th_meds)

Sickness: sick A
Group: group a
[1] "med 3" "med 5"
---------------------------------------------------------------
Sickness: sick B
Group: group a
[1] "med 3" "med 5"
---------------------------------------------------------------
... and so on
TARehman
  • 6,659
  • 3
  • 33
  • 60