4

I have a data frame (my_data) and want to calculate the sum of only the 3 highest values even though there might be ties. I am quite new to R and I've used dplyr.

A tibble: 15 x 3
   city      month number
   <chr>     <chr>  <dbl>
 1 Lund      jan       12
 2 Lund      feb       12
 3 Lund      mar       18
 4 Lund      apr       28
 5 Lund      may       28
 6 Stockholm jan       15
 7 Stockholm feb       15
 8 Stockholm mar       30
 9 Stockholm apr       30
10 Stockholm may       10
11 Uppsala   jan       22
12 Uppsala   feb       30
13 Uppsala   mar       40
14 Uppsala   apr       60
15 Uppsala   may       30

This is the code I have tried:

# For each city, count the top 3 of variable number
my_data %>% group_by(city) %>% top_n(3, number) %>% summarise(top_nr = sum(number))

The expected (wanted) output is:

# A tibble: 3 x 2
  city      top_nr
  <chr>      <dbl>
1 Lund          86
2 Stockholm     75
3 Uppsala      130

but the actual R output is:

# A tibble: 3 x 2
  city      top_nr
  <chr>      <dbl>
1 Lund          86
2 Stockholm     90
3 Uppsala      160

It seems like if there are ties, all tied values are included in the summation. I wanted only 3 unique instances with highest values to be counted.

Any help would be much appreciated! :)

NelsonGon
  • 13,015
  • 7
  • 27
  • 57

3 Answers3

6

We can do a distinct to remove the duplicate elements. The way in which top_n works is that if the values are duplicated, it will keep that many dupe rows

my_data %>% 
   distinct(city, number, .keep_all = TRUE) %>%
   group_by(city) %>%
   top_n(3, number) %>%
   summarise(top_nr = sum(number))

Update

Based on the OP's new output, after the top_n output (which is not arranged), get the 'number' arranged in descending order and get the sum of first 3 'number'

my_data %>% 
   group_by(city) %>% 
   top_n(3, number) %>% 
   arrange(city,  desc(number)) %>% 
   summarise(number = sum(head(number, 3)))
# A tibble: 3 x 2
#  city      number
#  <chr>      <int>
#1 Lund          74
#2 Stockholm     75
#3 Uppsala      130

data

my_data <- structure(list(city = c("Lund", "Lund", "Lund", "Lund", "Lund", 
"Stockholm", "Stockholm", "Stockholm", "Stockholm", "Stockholm", 
"Uppsala", "Uppsala", "Uppsala", "Uppsala", "Uppsala"), month = c("jan", 
"feb", "mar", "apr", "may", "jan", "feb", "mar", "apr", "may", 
"jan", "feb", "mar", "apr", "may"), number = c(12L, 12L, 18L, 
28L, 28L, 15L, 15L, 30L, 30L, 10L, 22L, 30L, 40L, 60L, 30L)), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
"14", "15"))
akrun
  • 874,273
  • 37
  • 540
  • 662
5

Life might be way simpler without top_n():

dat %>%
  group_by(city) %>%
  summarize(
    top_nr = sum(tail(sort(number), 3))
    )
utubun
  • 4,400
  • 1
  • 14
  • 17
  • Wow! Very succinct! But as an R-amateur I find the part : sum(tail(sort(number), 3)), quite hard to understand (even thoug it sloved the problem necely). – Andrés Lagerlöf May 11 '19 at 18:54
  • Starting from the middle it `sort`s the `number` in *ascending order*; after that `tail` returns *last 3* numbers from the previous result, and pass those numbers into the `sum` function, which adds them together. – utubun May 11 '19 at 19:53
  • Nice! Thanks! :) – Andrés Lagerlöf May 12 '19 at 08:27
2

This tidyverse (actually, dplyr) solution is almost equal to akrun's, but filters the dataframe instead of getting the top_n.

library(tidyverse)

my_data %>%
  group_by(city) %>%
  arrange(desc(number), .by_group = TRUE) %>%
  filter(row_number() %in% 1:3) %>%
  summarise(top_nr = sum(number))
## A tibble: 3 x 2
#  city      top_nr
#  <chr>      <int>
#1 Lund          74
#2 Stockholm     75
#3 Uppsala      130
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66