1

I want to count the number of rows per person (hai_dispense_number) per month (month) in the following dataframe. My overall aim is to see if the mean count of rows increases from April to September. I'm pretty sure I should use the ave function to make a count variable. But all my attempts aren't working for me. See attempts below. Once I have the counts made, I think I'll be able to use ddply to do a mean summary per month. below is a toy df with the column 'obs' my desired output.

df
         hai_dispense_number date_of_claim hai_atc     month obs
9972511   Patient HAI0002664    2010-04-07 A10BA02     april   1
11376245  Patient HAI0002664    2010-05-04 A10BA02       may   1
12508505  Patient HAI0002664    2010-05-31 A10BA02       may   2
13480611  Patient HAI0002664    2010-06-30 A10BA02      june   1
13486327  Patient HAI0002664    2010-06-30 A10BH03      june   2
13567944  Patient HAI0002664    2010-06-08 A10BA02      june   3
15003657  Patient HAI0002664    2010-07-27 A10BA02      july   1
15003658  Patient HAI0002664    2010-07-27 A10BH03      july   2
16600413  Patient HAI0002664    2010-08-31 A10BB09    august   1
16600866  Patient HAI0002664    2010-08-23 A10BA02    august   2
16600867  Patient HAI0002664    2010-08-23 A10BH03    august   3
17537505  Patient HAI0002664    2010-08-27 A10BB09    august   4
19176349  Patient HAI0002664    2010-09-17 A10BB09 september   1
19176350  Patient HAI0002664    2010-09-17 A10BH03 september   2
19176358  Patient HAI0002664    2010-09-17 A10BA02 september   3
17765433  Patient HAI0006637    2010-09-17 A10BA02 september   4
12953451  Patient HAI0007418    2010-06-04 A10BA02      june   1
15786889  Patient HAI0007418    2010-07-28 A10BB09      july   1
15787103  Patient HAI0007418    2010-07-12 A10BB09      july   2
15787233  Patient HAI0007418    2010-07-05 A10BA02      july   3
15878776  Patient HAI0007418    2010-07-08 A10BB09      july   4
15908690  Patient HAI0007418    2010-07-23 A10BB09      july   5
17363576  Patient HAI0007418    2010-08-20 A10BB09    august   1
17554737  Patient HAI0007418    2010-08-13 A10BB09    august   2

Prior attempts

df$obs<-with(df, ave(month, hai_dispense_number, FUN=seq_along))  ##doesn't split by month

df$obs<-with(df, ave(month, hai_dispense_number, FUN=cumsum))  ##gives all NA values, think seq_along is actually what I want

df$obs <- ave(df$month, df$month, FUN=seq_along)  ##this is better than the previous two, but doesn't seem to split by person

ddply(df,~month,summarise,mean=mean(obs)) ##this works absolutely fine, just need to counts right first!

Would value any input anyone could offer me. Seems like I'm getting something fundamental wrong here.

ekad
  • 14,436
  • 26
  • 44
  • 46
user2363642
  • 727
  • 9
  • 26
  • 2
    If you 'only' need to count each combination of factor levels you may try `table`. Could this be something: `with(df, data.frame(table(hai_dispense_number, month)))`? – Henrik Jan 29 '14 at 12:55
  • @Henrik nice and simple solution! Thank you! – user2363642 Jan 29 '14 at 13:32

1 Answers1

2

Okay, I have trimmed down your data to this:

> head(df)
            patient month
9972511  HAI0002664 april
11376245 HAI0002664   may
12508505 HAI0002664   may
13480611 HAI0002664  june
13486327 HAI0002664  june
13567944 HAI0002664  june

That's all we need since we are working with patient identifiers and months only. To get the required new column, try this:

library(plyr)

> ddply(df, .(patient, month), mutate, obs = 1:length(month))
      patient     month obs
1  HAI0002664     april   1
2  HAI0002664    august   1
3  HAI0002664    august   2
4  HAI0002664    august   3
5  HAI0002664    august   4
6  HAI0002664      july   1
7  HAI0002664      july   2
8  HAI0002664      june   1
9  HAI0002664      june   2
10 HAI0002664      june   3
11 HAI0002664       may   1
12 HAI0002664       may   2
13 HAI0002664 september   1
14 HAI0002664 september   2
15 HAI0002664 september   3
16 HAI0006637 september   1
17 HAI0007418    august   1
18 HAI0007418    august   2
19 HAI0007418      july   1
20 HAI0007418      july   2
21 HAI0007418      july   3
22 HAI0007418      july   4
23 HAI0007418      july   5
24 HAI0007418      june   1

Incidentally, I assume that in your sample output the obs = 4 for September is a type since the patient identifier has changed from the previous three (2664 to 6637).

datawookie
  • 1,607
  • 12
  • 20