How to use Dplyr's Summarize and which() to lookup min/max values

Question

I have the following data:

Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D") 
data <- data.frame(Name, Age, Group)

And I'd like to use dplyr to

(1) group the data by "Group" (2) show the min and max Age within each Group (3) show the Name of the person with the min and max ages

The following code does this:

data %>% group_by(Group) %>%
     summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))], 
               maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])

Which works well:

  Group minAge minAgeName maxAge maxAgeName
1     A     22        Sam     22        Sam
2     B     12      Sarah     58      James
3     C     17     Andrew     82      Sally
4     D     12     Mairin     67        Ray

However, I have a problem if there are multiple min or max values:

Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,31,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D") 
data <- data.frame(Name, Age, Group)

> data %>% group_by(Group) %>%
+   summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))], 
+             maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Error: expecting a single value

I'm looking for two solutions:

(1) where it doesn't matter which min or max name is shown, just that one is shown (i.e., the first value found) (2) where if there are "ties" all minimum values and maximum values are shown

Please let me know if this isn't clear and thanks in advance!

Using `data.table`. `setDT(data)[, c(.SD[which.min(Age)], .SD[which.max(Age)]), Group]` and change the names accordingly — akrun, May 12 '15 at 16:31
@akrun one `.SD`: `setDT(data)[, .SD[c(which.min(Age),which.max(Age))], Group]`. Similarly, the selecting rows thing works here again: `setDT(data)[data[, .I[c(which.min(Age),which.max(Age))], Group]$V1]`. Still, that's only the first half of the question (in case someone's making an answer of it). — Frank, May 12 '15 at 16:36
@Frank That was a good one using `c`, but your solution is in the long form isn't it. — akrun, May 12 '15 at 16:37
@Frank For the tie case perhaps `setDT(data)[, c(.SD[Age==min(Age)], .SD[Age==max(Age)]) , Group]` (though I am not sure what the OP expects) — akrun, May 12 '15 at 16:47
@akrun I've put your double-`.SD` thing into an answer below. Suggestions welcome — Frank, May 12 '15 at 17:22

score 36 · Accepted Answer · answered May 12 '15 at 16:27

36

You can use which.min and which.max to get the first value.

data %>% group_by(Group) %>%
  summarize(minAge = min(Age), minAgeName = Name[which.min(Age)], 
            maxAge = max(Age), maxAgeName = Name[which.max(Age)])

To get all values, use e.g. paste with an appropriate collapse argument.

data %>% group_by(Group) %>%
  summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "), 
            maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))

answered May 12 '15 at 16:27

shadow

21,823
4
63
77

1

Thanks again -- this is exactly what I was looking for as well. I wasn't clear about the wide format I wanted the data in. This works well for the "test" data I posted, but in an example where you had 100 ties (as an extreme example), collapsing everything into a single row can become unwieldy. But that's my fault for not explaining, not yours -- this solution works very well. – dreww2 May 12 '15 at 17:54

A5C1D2H2I1M1N2O1R2T1 · Answer 2 · 2015-05-12T16:40:20.657

14

I would actually recommend keeping your data in a "long" format. Here's how I would approach this:

library(dplyr)

Keeping all values when there are ties:

data %>%
  group_by(Group) %>%
  arrange(Age) %>%  ## optional
  filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
# 
#     Name Age Group
# 1    Sam  22     A
# 2  Sarah  31     B
# 3    Jim  31     B
# 4  James  58     B
# 5 Andrew  17     C
# 6  Sally  82     C
# 7 Mairin  12     D
# 8    Ray  67     D

Keeping only one value when there are ties:

data %>%
  group_by(Group) %>%
  arrange(Age) %>%
  slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
# 
#     Name Age Group
# 1    Sam  22     A
# 2  Sarah  31     B
# 3  James  58     B
# 4 Andrew  17     C
# 5  Sally  82     C
# 6 Mairin  12     D
# 7    Ray  67     D

If you really want a "wide" dataset, the basic concept would be to gather and spread the data, using "tidyr":

library(dplyr)
library(tidyr)

data %>%
  group_by(Group) %>%
  arrange(Age) %>%
  slice(c(1, n())) %>%
  mutate(minmax = c("min", "max")) %>%
  gather(var, val, Name:Age) %>%
  unite(key, minmax, var) %>%
  spread(key, val)
# Source: local data frame [4 x 5]
# 
#   Group max_Age max_Name min_Age min_Name
# 1     A      22      Sam      22      Sam
# 2     B      58    James      31    Sarah
# 3     C      82    Sally      17   Andrew
# 4     D      67      Ray      12   Mairin

Though what wide form you would want with ties is unclear.

edited May 12 '15 at 16:40

answered May 12 '15 at 16:33

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

Somewhat harder to read (and maybe to work with) than one row per group with max and min columns as in the OP, though. – Frank May 12 '15 at 16:35
@Frank, how would you deal with multiple cases? Pasting together seems harder to work with to me. – A5C1D2H2I1M1N2O1R2T1 May 12 '15 at 16:36
Now that I think about it, for the "keep ties" case, yeah, it's much better long, and for consistency, might as well do them both that way. Maybe I'd remove `if (length(Age)==1) 1 else` and add a categorical var `minormax=1` if min `=2` if max. – Frank May 12 '15 at 16:40
1

@Frank, way ahead of you :-) – A5C1D2H2I1M1N2O1R2T1 May 12 '15 at 16:41
Ah I see :) Well, I meant I'd do that even for the long form (keep max and min even if they're the same; keep the `minmax` column). – Frank May 12 '15 at 16:44
I think you know what I mean, but I've added the long-form output I mean in an answer. – Frank May 12 '15 at 17:20
1

@Frank, the minmax idea is already embedded in my last answer. Just stop at the `mutate` line. When there are multiples, you can use `dense_rank` and `factor`. – A5C1D2H2I1M1N2O1R2T1 May 12 '15 at 17:26
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/77638/discussion-between-frank-and-ananda-mahto). – Frank May 12 '15 at 17:30
1

Many thanks! I particularly liked the simplicity of the filter %in% range solution. – dreww2 May 12 '15 at 17:52

Frank · Answer 3 · 2015-05-12T17:44:11.003

Here are some data.table approaches, the first one borrowed from @akrun:

setDT(data)

# show one, wide format
data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
   # Group min.Name min.Age max.Name max.Age
# 1:     A      Sam      22      Sam      22
# 2:     B    Sarah      31    James      58
# 3:     C   Andrew      17    Sally      82
# 4:     D   Mairin      12      Ray      67

# show all, long format
data[,{
  mina=min(Age)
  maxa=max(Age)
  rbind(
    data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
    data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
)},by=Group]
   # Group minmax Age   Name
# 1:     A    min  22    Sam
# 2:     A    max  22    Sam
# 3:     B    min  31  Sarah
# 4:     B    min  31    Jim
# 5:     B    max  58  James
# 6:     C    min  17 Andrew
# 7:     C    max  82  Sally
# 8:     D    min  12 Mairin
# 9:     D    max  67    Ray

I think the long format is the best, since it allows you to filter with minmax, but the code is tortured and inefficient.

Here are some arguably less good ways:

# show all, wide format (with a list column)
data[,{
  mina=min(Age)
  maxa=max(Age)
  list(
    minAge=mina,
    maxAge=maxa,
    minNames=list(Name[Age==mina]),
    maxNames=list(Name[Age==maxa]))
},by=Group]
   # Group minAge maxAge  minNames maxNames
# 1:     A     22     22       Sam      Sam
# 2:     B     31     58 Sarah,Jim    James
# 3:     C     17     82    Andrew    Sally
# 4:     D     12     67    Mairin      Ray


# show all, wide format (with a string column)
# (just look at @shadow's answer)

For the second set of code, did you used the 2nd example data ( I get 8 rows as output) — akrun, May 12 '15 at 17:38
@akrun Good catch. I've switched to my original, blatantly inefficient code that does produce the nine rows. The only alternative I can think of is the equally inefficient choice of putting the `ifelse` earlier in the code. I think that `dplyr` wins this round, since producing the result in my second table isn't so hard there (as Ananda explained in comments). — Frank, May 12 '15 at 17:45
Very nice, thank you! I'm not using data.table in this instance because performance isn't critical, but I appreciate the post. — dreww2, May 12 '15 at 17:52
Note to self/would-be editors: the ugly code in the second block can be cleaned up with the `frank()` function. — Frank, May 12 '15 at 18:36

How to use Dplyr's Summarize and which() to lookup min/max values

3 Answers3

Linked

Related