R: subset/group data frame with a max value?

Question

Given a data frame like this:

  gid set  a  b
1   1   1  1  9
2   1   2 -2 -3
3   1   3  5  6
4   2   2 -4 -7
5   2   6  5 10
6   2   9  2  0

How can I subset/group data frame of a unique gid with the max set value and 1/0 wether its a value is greater than its b value?

So here, it'd be, uh...

1,3,0
2,9,1

Kind of a stupid simple thing in SQL but I'd like to have a bit better control over my R, so...

hrbrmstr · Accepted Answer · 2015-01-21T02:13:49.017

6

Piece of cake with dplyr:

dat <- read.table(text="gid set  a  b
1   1  1  9
1   2 -2 -3
1   3  5  6
2   2 -4 -7
2   6  5 10
2   9  2  0", header=TRUE)

library(dplyr)

dat %>%
  group_by(gid) %>%
  filter(row_number() == which.max(set)) %>%
  mutate(greater=a>b) %>%
  select(gid, set, greater)

## Source: local data frame [2 x 3]
## Groups: gid
## 
##   gid set greater
## 1   1   3   FALSE
## 2   2   9    TRUE

If you really need 1's and 0's and the dplyr groups cause any angst:

dat %>%
  group_by(gid) %>%
  filter(row_number() == which.max(set)) %>%
  mutate(greater=ifelse(a>b, 1, 0)) %>%
  select(gid, set, greater) %>%
  ungroup

## Source: local data frame [2 x 3]
## 
##   gid set greater
## 1   1   3       0
## 2   2   9       1

You could do the same thing without pipes:

ungroup(
  select(
    mutate(
      filter(row_number() == which.max(set)), 
      greater=ifelse(a>b, 1, 0)), gid, set, greater))

but…but… why?! :-)

edited Jan 21 '15 at 02:13

answered Jan 21 '15 at 02:00

hrbrmstr

77,368
11
139
205

It's a fairly popular "piping" idiom first started by the `magrittr` package, but used extensively in the "Hadleyverse" (i.e. `dplyr`, `ggvis`, `tidyr`, etc). Rather than convoluted, nested parenthesis, this idiom "pipes" data to functions for processing, similar to the way D3 javascript function chains work. – hrbrmstr Jan 21 '15 at 02:07
1

This might be problematic if there are ties for the max of set. `filter(row_number() == which.max(set))` might be safer – Rich Scriven Jan 21 '15 at 02:11
#ty @RichardScriven. oversight on my part. answer: updated. interestingly enough, that hasn't been an issue in the netflow data I've been processing of late. definitely an edge case to watch out for. – hrbrmstr Jan 21 '15 at 02:16
How might you sort the result by e.g. set number ascending? – Wells Jan 21 '15 at 05:51
Also, this `%>%` seems to max out at ten items, and then I get a line of `.. ... ...`, seemingly indicating more? Why the limitation and how would I avoid it? – Wells Jan 21 '15 at 05:53
A variation of the answer would be: `dat %>% group_by(gid) %>% slice(which.max(set)) %>% transmute(set, greater = as.integer(a>b))` – talat Jan 21 '15 at 07:03
@RichardScriven, re your comment on the filter. `mtcars %>% filter(row_number() == which.max(cyl))` will return only the first max and imo would be simpler with `mtcars %>% slice(which.max(cyl))`. If you wanted all max-values returned, you could use for example `mtcars %>% filter(cyl %in% max(cyl))` or `mtcars %>% filter(min_rank(desc(cyl)) == 1)`. I think some clarifying comments are missing so not sure what you meant exactly. – talat Jan 21 '15 at 07:46
@wells that's just `dplyr` printing sanely (you can add `%>% print(n=1000)` to the end of the chain for viewing. I'll update the answer to convert it to a normal data.frame and also add @docendo's alternate answer to the example. – hrbrmstr Jan 21 '15 at 11:06

Rich Scriven · Answer 2 · 2015-01-21T03:15:23.260

Here's a data.table possibility, assuming your original data is called df.

library(data.table)

setDT(df)[, .(set = max(set), b = as.integer(a > b)[set == max(set)]), gid]
#    gid set b
# 1:   1   3 0
# 2:   2   9 1

Note that to account for multiple max(set) rows, I used set == max(set) as the subset so that this will return the same number of rows for which there are ties for the max (if that makes any sense at all).

And courtesy of @thelatemail, another data table option:

setDT(df)[, list(set = max(set), ab = (a > b)[which.max(set)] + 0), by = gid]
#    gid set ab
# 1:   1   3  0
# 2:   2   9  1

akrun · Answer 3 · 2015-01-21T04:49:05.290

1

In base R, you can use ave

indx <- with(df, ave(set, gid, FUN=max)==set)
#in cases of ties
#indx <- with(df, !!ave(set, gid, FUN=function(x) 
#                  which.max(x) ==seq_along(x)))


transform(df[indx,], greater=(a>b)+0)[,c(1:2,5)]
#   gid set greater
# 3   1   3       0
# 6   2   9       1

edited Jan 21 '15 at 04:49

answered Jan 21 '15 at 04:43

akrun

874,273
37
540
662

R: subset/group data frame with a max value?

3 Answers3