Performing functions on grouped rows in an R dataframe

Question

I have a large dataframe where multiple rows are repeated measurements for a single ID. I want to return the rows with the maximum value of a column for each individual. Essentially performing a group.by() function as per SQL.

Dataframe (for illustrative purposes)

I want to call max(lac) and return the following results.

ID lac pO2
M1 4   80
M2 3   70
M3 5   75

I've had a look around and thought that the by() function might be useful, but haven't had any joy (code below).

newdf <- by(df, df$ID, max(df$lac))

Error in FUN(X[[1L]], ...) : could not find function "FUN"

I also looked at tapply but this doesn't work because I'm using a dataframe rather than a vector.

newdf <- tapply(df, df$ID, max)

Error: "arguments must have same length"

I've looked at similar answers, but these haven't helped. I'd appreciate some input from people more experienced than I!

Edit

Having dug a little deeper I've uncovered this question which suggests the plyr package might be useful.

By and tapply work differently, first argument needs to be in the right format for the function: `tapply(df$lac, df$ID, max)` — Señor O, Aug 11 '14 at 14:54

score 2 · Answer 1 · answered Aug 11 '14 at 14:17

2

Try this:

> by(mtcars, mtcars$cyl, max)
mtcars$cyl: 4
[1] 146.7
--------------------------------------------------------------------------------------- 
mtcars$cyl: 6
[1] 258
--------------------------------------------------------------------------------------- 
mtcars$cyl: 8
[1] 472

Alternatively use plyr:

> require(plyr)
Loading required package: plyr
> ddply(mtcars, .(cyl), max)
  cyl    V1
1   4 146.7
2   6 258.0
3   8 472.0

answered Aug 11 '14 at 14:17

landroni

2,902
1
32
39

Thanks @landroni, I managed to work it out simultaneously to you posting your solution by doing something similar. :) – s_boardman Aug 11 '14 at 14:23

score 2 · Answer 2 · answered Aug 11 '14 at 14:22

2

For large data set try data.table (assuming df is your data set)

library(data.table)
setDT(df)[, .SD[which.max(lac)], by = ID]

##    ID lac pO2
## 1: M1   4  80
## 2: M2   3  70
## 3: M3   5  75

answered Aug 11 '14 at 14:22

David Arenburg

91,361
17
137
196

score 1 · Accepted Answer · answered Aug 11 '14 at 14:21

1

Found a solution using plyr as discussed in update.

Code used was:

max_lac <- ddply(.data=df, .variables=.(ID), function(x) 
+ x[which(x$lac == max(x$lac)), ])

answered Aug 11 '14 at 14:21

s_boardman

416
3
9
27

talat · Answer 4 · 2014-08-11T14:59:04.183

1

Here's a dplyr alternative in case you're processing large data sets:

library(dplyr)

df %>% group_by(ID) %>% filter(lac == max(lac))

#Source: local data frame [3 x 3]
#Groups: ID
#
#  ID lac pO2
#1 M1   4  80
#2 M2   3  70
#3 M3   5  75

Note that in case of multiple rows with maximas in the same group of ID, this function will return all rows containing the maximum value in lac, whereas functions using which.max(.) will only return the first row containing the maximum (per group).

If you only want to return the first max per group, you can use for example:

df %>% group_by(ID) %>% filter(1:n() == which.max(lac))

or

df %>% group_by(ID) %>% filter(lac == max(lac)) %>% do(head(.,1))

edited Aug 11 '14 at 14:59

answered Aug 11 '14 at 14:49

talat

68,970
21
126
157

I'm intentionally didn't post this so you won't be mad at me :) – David Arenburg Aug 11 '14 at 15:00
@DavidArenburg, awesome, you're a real team player ;) – talat Aug 11 '14 at 15:01
1

You also may want to add here that the OP needs to `detach` `plyr` (as it seems like he already using it) – David Arenburg Aug 11 '14 at 15:36
1

@DavidArenburg, good point. I can't test right now but I'm guessing that for these functions there wouldn't be an immediate problem with plyr loaded. Also, it's possible to have bother packages loaded if dplyr is loaded after plyr as far as I know (but probably not recommended) – talat Aug 11 '14 at 15:49
Thanks @beginneR & @DavidAreburg! I hadn't considered multiple maximal rows for each ID, good point. That may be useful in the future if this pilot study is successful. :) – s_boardman Aug 12 '14 at 09:03

score 0 · Answer 5 · answered Aug 11 '14 at 14:24

0

(if you look at ddply and which.max) Can it be?:

ddply(df,.ID,function(x){x[which.max(x$lac)),]}

answered Aug 11 '14 at 14:24

user3929561

41
1

Performing functions on grouped rows in an R dataframe

5 Answers5