So I am trying to find the max, second max, and third max water use per customer ID per year for a dataset. I'm using R and the sqldf library, but am open to any R solutions to this. Here's a bit of sample data:
Year | ID | Month | Use |
----------------------------
2009 101 1 103
2009 101 2 209
2009 101 3 375
2009 101 4 360
2010 101 1 170
2010 101 2 381
2010 101 3 275
2010 101 4 260
2009 102 1 263
2009 102 2 234
2009 102 3 45
2009 102 4 275
2010 102 1 469
2010 102 2 107
2010 102 3 354
2010 102 4 436
Ideally I would want to return three matrices, max1
, max2
, max3
with columns ID
, Year
, Max
(or second max or third max, respectively)
So max1 = [101, 2009, 375, 101, 2010, 381, 102, 2009, 275, 102, 2010, 469]
etc.
My initial approach was to make a nested for loop with listofIDs
and listofYears
as the domains of ID
and Year
, like:
for i in 1:length(listofIDs){
for y in 1:length(listofYears){
monthlylist<-sqldf("select Month, Use from Dataframe where ID=listofIDs[i] and Year=listofYears[y]")
and then sort monthlylist
and pull out the max's, etc.
But sqldf
won't read variables like that so I would have to explicitly state where ID = 101
, where ID = 102
each time.
Any ideas on how to get sqldf to recognize my varibles, or a better way to find the max, second max, and third max aggregated by year and ID? I am working with big datasets so ideally something that doesn't take forever.