How to summarize the top 3 highest values in a dataset when there are ties

Question

I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr.

A tibble: 15 x 3
   city      month number
   <chr>     <chr>  <dbl>
 1 Lund      jan       12
 2 Lund      feb       12
 3 Lund      mar       18
 4 Lund      apr       28
 5 Lund      may       28
 6 Stockholm jan       15
 7 Stockholm feb       15
 8 Stockholm mar       30
 9 Stockholm apr       30
10 Stockholm may       10
11 Uppsala   jan       22
12 Uppsala   feb       30
13 Uppsala   mar       40
14 Uppsala   apr       60
15 Uppsala   may       30

This is the code I have tried:

# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))

The expected (wanted) output is:

# A tibble: 3 x 2
  city      top_nr
  <chr>      <dbl>
1 Lund          86
2 Stockholm     75
3 Uppsala      130

but the actual R output is:

# A tibble: 3 x 2
  city      top_nr
  <chr>      <dbl>
1 Lund          86
2 Stockholm     90
3 Uppsala      160

It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.

Any help would be much appreciated! :)

Your example and output numbers showed seems to be different. Is it on a different dataset — akrun, May 11 '19 at 17:14
Please use `dput` to provide your data so it is easier to reproduce the problem. — NelsonGon, May 11 '19 at 17:33
@NelsonGon I didn't know about the dput function. I will use it in the future. — Andrés Lagerlöf, May 12 '19 at 08:37

akrun · Answer 1 · 2019-05-11T17:43:58.303

6

We can do a distinct to remove the duplicate elements. The way in which top_n works is that if the values are duplicated, it will keep that many dupe rows

my_data %>% 
   distinct(city, number, .keep_all = TRUE) %>%
   group_by(city) %>%
   top_n(3, number) %>%
   summarise(top_nr = sum(number))

Update

Based on the OP's new output, after the top_n output (which is not arranged), get the 'number' arranged in descending order and get the sum of first 3 'number'

my_data %>% 
   group_by(city) %>% 
   top_n(3, number) %>% 
   arrange(city,  desc(number)) %>% 
   summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
#  city      number
#  <chr>      <int>
#1 Lund          74
#2 Stockholm     75
#3 Uppsala      130

data

my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund", 
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm", 
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan", 
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may", 
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L, 
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"))

edited May 11 '19 at 17:43

answered May 11 '19 at 17:10

akrun

874,273
37
540
662

Thanks @akrun but when I run your suggested code I get – Andrés Lagerlöf May 11 '19 at 17:16
Why use `head` here? I tried something similar and got weird results. – NelsonGon May 11 '19 at 17:34
1

@NelsonGon Based on the new output, the OP is not discarding the dupes, but taking the first 3 only, top_n gives as many rows as there are dupes – akrun May 11 '19 at 17:35
1

Ooh, I see. Thanks for the explanation. – NelsonGon May 11 '19 at 17:37
1

@akrun nice solution! Thanks! – Andrés Lagerlöf May 11 '19 at 18:21

score 5 · Accepted Answer · answered May 11 '19 at 18:21

5

Life might be way simpler without top_n():

dat %>%
  group_by(city) %>%
  summarize(
    top_nr = sum(tail(sort(number), 3))
    )

answered May 11 '19 at 18:21

utubun

4,400
1
14
17

Wow! Very succinct! But as an R-amateur I find the part : sum(tail(sort(number), 3)), quite hard to understand (even thoug it sloved the problem necely). – Andrés Lagerlöf May 11 '19 at 18:54
Starting from the middle it `sort`s the `number` in *ascending order*; after that `tail` returns *last 3* numbers from the previous result, and pass those numbers into the `sum` function, which adds them together. – utubun May 11 '19 at 19:53
Nice! Thanks! :) – Andrés Lagerlöf May 12 '19 at 08:27

score 2 · Answer 3 · answered May 11 '19 at 17:54

This tidyverse (actually, dplyr) solution is almost equal to akrun's, but filters the dataframe instead of getting the top_n.

library(tidyverse)

my_data %>%
  group_by(city) %>%
  arrange(desc(number), .by_group = TRUE) %>%
  filter(row_number() %in% 1:3) %>%
  summarise(top_nr = sum(number))
## A tibble: 3 x 2
#  city      top_nr
#  <chr>      <int>
#1 Lund          74
#2 Stockholm     75
#3 Uppsala      130

How to summarize the top 3 highest values in a dataset when there are ties

3 Answers3

Update

data

Linked