1

Various R functions make it easy to use group_by and summarize to pull a value from a grouped variable. So in the resulting dataframe, I can use group_by and summarise to create, for example, a new column that contains the maximum or minimum value of a variable within each group. Meaning, with this data:

name, value
foo, 100
foo, 200
foo, 300
bar, 400
bar, 500
bar, 600

I can easily get the max or min for each value of name:

group_by(name) %>% summarize(maxValue = max(value)

But suppose I want the second ranked value for each name? Meaning suppose I want my result to be

name maxValue midValue
foo 300 200
bar 600 500

In other words, how do I fill in the blank in this:

df %>% group_by(name) %>% 
summarize(maxValue = max(value), 
  secondValue = _________)

Thanks, from an r newbie, for any help!

oneyellowlion
  • 195
  • 1
  • 11
  • Is the last value of `name` supposed to be `Ba4` or is that a typo? And the case of the names is inconsistent comparing your input data to the desired result. Also: the middle value in this case is the second, but do you always want the second, or the middle? – neilfws Nov 15 '22 at 21:14
  • @diomedesdata: Thanks, but that's not really getting my what I'm looking for. That post deals with ordering the results. My questions goes one step further: how do I pull out the second ranked value after ordering. – oneyellowlion Nov 15 '22 at 22:19
  • @neilfws: 1) yep, typo, sorry. 2) Yeah, case of names should be the same. 3) I want second, not always middle. Thanks for any help. First answer in looks like it's going to work -- need to fool with it. – oneyellowlion Nov 15 '22 at 22:20
  • @oneyellowlion reiterating Juan C's comment - what is your expected output in the case of a tie (e.g. when there are multiple rows with the maxium value)? – diomedesdata Nov 16 '22 at 12:57

4 Answers4

2
library(dplyr)

df %>% 
  group_by(name) %>% 
  arrange(value) %>% 
  summarise(maxValue = max(value), 
            midValue = value[2])

Result:

# A tibble: 2 × 3
  name  maxValue midValue
  <chr>    <int>    <int>
1 Bar        600      500
2 Foo        300      200
neilfws
  • 32,751
  • 5
  • 50
  • 63
1

This should do:

df %>% group_by(name) %>% arrange(desc(value)) %>% slice(2)

Code:

a = 'name value
Foo 100
Foo 200
Foo 300
Bar 400
Bar 500
Bar 600'


df = read.table(text = a, header = T)
df %>% group_by(name) %>% arrange(desc(value)) %>% slice(2)

Output:

# A tibble: 2 × 2
# Groups:   name [2]
  name  value
  <fct> <int>
1 Bar     500
2 Foo     200
Juan C
  • 5,846
  • 2
  • 17
  • 51
  • Thanks! We're close, but I realize my sample output isn't actually exactly what I'm looking for. I actually need just the value in the output, not the whole row. So, to be more specific using your example, imagine that for each name, I need the max value and the second highest value. I think what I'm asking is, how do I fill in the blank in this: ``` df %>% group_by(name) %>% arrange(desc(value)) %>% summarize(maxValue = max(value), secondValue = _______________) ``` – oneyellowlion Nov 15 '22 at 22:55
  • Does this work for you?df %>% group_by(name) %>% arrange(desc(value)) %>% slice(2)%>% select(value) ? – Eva Nov 16 '22 at 00:07
  • What's your expected output when there's ties i.e. if the maximum value and the second maximum value are the same? – Juan C Nov 16 '22 at 12:05
1

We can pull out the 2 maximum values per group, and select the second largest one.

df %>% 
  group_by(name) %>% 
  summarise(maxvalue = sortN(value, 1),
            midvalue = sortN(value, 2)[2])
# A tibble: 2 × 3
  name  maxvalue midvalue
  <chr>    <int>    <int>
1 Bar        600      500
2 Foo        300      200

Note that the code above will handle ties (e.g. if two rows in the group bar had the value 600 it will look for another value*). This is not important in your example but could be important with other data.

The function sortN(x, n, type = "max") is defined below. It pulls out the n max/min values of x. This is heavily based on the post in my comment. It is not necessary to define a whole function for this problem (as the other answers show), but I find this function useful for a range of problems, so it's nice to have.

sortN <- function(x, n, type = "max") {

  # GR 1 - If type is not "max" or "min", error
  if (! type %in% c("max", "min")) {
    stop("type must be max or min.")
  }

  # GR 2 - If n >= length(unique(x)), return whole vector
  if (n >= length(unique(x))){
    return(unique(x))
  }

  # Change based on whether the user wants min or max
  type <- switch(type, min = FALSE, max = TRUE)

  if (type) {
  x <- unique(x)
  partial <- length(x) - n + 1
  out <- x[x >= sort(x, partial = partial)[partial]]
  sort(out, decreasing = TRUE)
  } else {
  out <- -sortN(x = -x, n = n, type = "max")
  sort(out, decreasing = FALSE)
  }
}

* Though as is, the code will error if there are not at least two distinct values in the group. Up to you to decide whether that is important. In any case, it could be solved easily with a small ifelse statement around the existing code.

diomedesdata
  • 995
  • 1
  • 6
  • 15
0
library(dplyr)

df <- data.frame(
  name = c("Foo", "Foo", "Foo", "Bar", "Bar", "Bar"),
  value = c(100, 200, 300, 400, 500, 600)
)

df %>% 
  group_by(name) %>% 
  summarize(secondValue = sort(value, decreasing = TRUE)[2])

Santiago
  • 641
  • 3
  • 14