22

I have the following data frame called surge:

MeshID    StormID Rate Surge Wind
1         1412 1.0000E-01   0.01 0.0
2         1412 1.0000E-01   0.03 0.0
3         1412 1.0000E-01   0.09 0.0
4         1412 1.0000E-01   0.12 0.0
5         1412 1.0000E-01   0.02 0.0
6         1412 1.0000E-01   0.02 0.0
7         1412 1.0000E-01   0.07 0.0
1         1413 1.0000E-01   0.06 0.0
2         1413 1.0000E-01   0.02 0.0
3         1413 1.0000E-01   0.05 0.0

I used the following code to find the max value of surge per storm:

MaxSurge <- data.frame(tapply(surge[,4], surge[,2], max))

It returns:

1412 0.12
1413 0.06

This is great, except I'd also like it to include the MeshID value at the point where the surge is the maximum. I know I can probably use which.max, but I can't quite figure out how to put this in action. I'm VERY new to R programming.

Arun
  • 116,683
  • 26
  • 284
  • 387
kimmyjo221
  • 685
  • 4
  • 10
  • 17

4 Answers4

14

And a data.table solution for coding elegance

library(data.table)
surge <- as.data.table(surge)
surge[, .SD[which.max(surge)], by = StormID]
mnel
  • 113,303
  • 27
  • 265
  • 254
13

here is another data.table solution, but not relying on .SD (thus 10x faster)

surge[,grp.ranks:=rank(-1*surge,ties.method='min'),by=StormID]
surge[grp.ranks==1,]
Andro Selva
  • 53,910
  • 52
  • 193
  • 240
massyah
  • 165
  • 1
  • 6
  • 3
    +1 Very nice! When `.I` is added, it'll be easier (and even faster I hope): `surge[ surge[,.I[which.max(surge)],by=StormID,drop=TRUE]]`. That's a bit ugly though so we could auto optimize the `.SD` approach to do that under the hood, to retain the elegance of mnel's answer. So just to note that it is true as you rightly say that `.SD` should be avoided if possible, currently, because it creates the entire subset which might not be needed. But this will hopefully not be true in future. One of the reasons it's all inside `[...]` is so `data.table` can optimize things like this in future. – Matt Dowle Oct 18 '12 at 13:14
7

If you have 2 data.points at the maximum, which.max will only refer to the first one. A more complete solution would involve rank:

# data with a tie for max  
surge <- data.frame(MeshID=c(1:7,1:4),StormID=c(rep(1412,7),
rep(1413,4)),Surge=c(0.01,0.03,0.09,0.12,0.02,0.02,0.07,0.06,0.02,0.05,0.06))

# compute ranks  
surge$rank <- ave(-surge$Surge,surge$StormID,FUN=function(x) rank(x,ties.method="min"))
# subset on the rank  
subset(surge,rank==1)
   MeshID StormID Surge rank
4       4    1412  0.12    1
8       1    1413  0.06    1
11      4    1413  0.06    1
James
  • 65,548
  • 14
  • 155
  • 193
6

Here's a plyr solution, just because someone will say it if I don't...

R> ddply(surge, "StormID", function(x) x[which.max(x$Surge),])
  MeshID StormID Rate Surge Wind
1      4    1412  0.1  0.12    0
2      1    1413  0.1  0.06    0
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • The two methods seem to have given different results. The `ddply` version works, because inside the function you are indexing a subset of `x`. In the `tapply` version `which.max` returns the index of the maximum in the subset but uses it to index the whole of `x`. – seancarmody Oct 03 '12 at 11:53
  • Can I ask a further question? If I wanted to count the number of times the max is repeated for a particular stormID, how would I do that? At this point it is just picking the first instance of MeshID for which Surge is a max, correct? What if the max occurs more than once? Thank you. – kimmyjo221 Oct 04 '12 at 15:03
  • Perfect! Sorry one more question. What if I'm really only interested in those cases where surge > .10? – kimmyjo221 Oct 04 '12 at 15:34