0

I have a large dataframe where multiple rows are repeated measurements for a single ID. I want to return the rows with the maximum value of a column for each individual. Essentially performing a group.by() function as per SQL.

Dataframe (for illustrative purposes)

ID lac pO2
M1 1   80
M1 4   80
M2 2   70
M2 3   70
M3 3   75
M3 5   75

I want to call max(lac) and return the following results.

ID lac pO2
M1 4   80
M2 3   70
M3 5   75

I've had a look around and thought that the by() function might be useful, but haven't had any joy (code below).

newdf <- by(df, df$ID, max(df$lac))

Error in FUN(X[[1L]], ...) : could not find function "FUN"

I also looked at tapply but this doesn't work because I'm using a dataframe rather than a vector.

newdf <- tapply(df, df$ID, max)

Error: "arguments must have same length"

I've looked at similar answers, but these haven't helped. I'd appreciate some input from people more experienced than I!

Edit

Having dug a little deeper I've uncovered this question which suggests the plyr package might be useful.

Community
  • 1
  • 1
s_boardman
  • 416
  • 3
  • 9
  • 27
  • By and tapply work differently, first argument needs to be in the right format for the function: `tapply(df$lac, df$ID, max)` – Señor O Aug 11 '14 at 14:54

5 Answers5

2

Try this:

> by(mtcars, mtcars$cyl, max)
mtcars$cyl: 4
[1] 146.7
--------------------------------------------------------------------------------------- 
mtcars$cyl: 6
[1] 258
--------------------------------------------------------------------------------------- 
mtcars$cyl: 8
[1] 472

Alternatively use plyr:

> require(plyr)
Loading required package: plyr
> ddply(mtcars, .(cyl), max)
  cyl    V1
1   4 146.7
2   6 258.0
3   8 472.0
landroni
  • 2,902
  • 1
  • 32
  • 39
  • Thanks @landroni, I managed to work it out simultaneously to you posting your solution by doing something similar. :) – s_boardman Aug 11 '14 at 14:23
2

For large data set try data.table (assuming df is your data set)

library(data.table)
setDT(df)[, .SD[which.max(lac)], by = ID]

##    ID lac pO2
## 1: M1   4  80
## 2: M2   3  70
## 3: M3   5  75
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
1

Found a solution using plyr as discussed in update.

Code used was:

max_lac <- ddply(.data=df, .variables=.(ID), function(x) 
+ x[which(x$lac == max(x$lac)), ])
s_boardman
  • 416
  • 3
  • 9
  • 27
1

Here's a dplyr alternative in case you're processing large data sets:

library(dplyr)

df %>% group_by(ID) %>% filter(lac == max(lac))

#Source: local data frame [3 x 3]
#Groups: ID
#
#  ID lac pO2
#1 M1   4  80
#2 M2   3  70
#3 M3   5  75

Note that in case of multiple rows with maximas in the same group of ID, this function will return all rows containing the maximum value in lac, whereas functions using which.max(.) will only return the first row containing the maximum (per group).

If you only want to return the first max per group, you can use for example:

df %>% group_by(ID) %>% filter(1:n() == which.max(lac))

or

df %>% group_by(ID) %>% filter(lac == max(lac)) %>% do(head(.,1))
talat
  • 68,970
  • 21
  • 126
  • 157
  • I'm intentionally didn't post this so you won't be mad at me :) – David Arenburg Aug 11 '14 at 15:00
  • @DavidArenburg, awesome, you're a real team player ;) – talat Aug 11 '14 at 15:01
  • 1
    You also may want to add here that the OP needs to `detach` `plyr` (as it seems like he already using it) – David Arenburg Aug 11 '14 at 15:36
  • 1
    @DavidArenburg, good point. I can't test right now but I'm guessing that for these functions there wouldn't be an immediate problem with plyr loaded. Also, it's possible to have bother packages loaded if dplyr is loaded after plyr as far as I know (but probably not recommended) – talat Aug 11 '14 at 15:49
  • Thanks @beginneR & @DavidAreburg! I hadn't considered multiple maximal rows for each ID, good point. That may be useful in the future if this pilot study is successful. :) – s_boardman Aug 12 '14 at 09:03
0

(if you look at ddply and which.max) Can it be?:

ddply(df,.ID,function(x){x[which.max(x$lac)),]}