1
dat <- data.frame(year = c(1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988),
             a = c(100, 200, 300, 400, 500, 350, 350, 350, 430),
             p = c(60, 90, 80, 75, 95, 600, 600, 600, 330),
             y = c(1, 1, 12, 11, 12,  13, 13, 13, 20))

if y is duplicated, then only keep the y with the highest a. If duplicated y have the same a, then keep y with the highest p. If duplicated y have the same a and p, then simply remove them. I did this:

dat %>% dplyr::arrange(y, a, p) %>% dplyr::group_by(y) %>% 
dplyr::filter(a == max(a)) %>% dplyr::filter(p == max(p)) %>%
distinct(a, p, y)

But how do I retain my year column here?

89_Simple
  • 3,393
  • 3
  • 39
  • 94

1 Answers1

1

As already mentioned in the comments, you can specify .keep_all = TRUE in distinct.

You are selecting your rows based on several conditions which you could combine in less lines of codes with arrange and slice:

library(dplyr)

dat %>% 
  group_by(y) %>% 
  arrange(a, p) %>% 
  slice(n())

Or:

dat %>% 
  group_by(y) %>% 
  arrange(desc(a), desc(p)) %>% 
  slice(1)

Further if you'd like to select the highest value given a weight you could use top_n:

dat %>% 
  group_by(y) %>% 
  top_n(1, a) %>%
  top_n(1, p) %>% 
  distinct(y, a, p, .keep_all = TRUE)

The last two approaches will keep the year 1985 for y = 13, wheras the first approach keeps the year 1987. You might want to specify which year (max or min) you want to keep for duplicated values.

kath
  • 7,624
  • 17
  • 32