columns removed while removing duplicated rows in dplyr

Question

dat <- data.frame(year = c(1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988),
             a = c(100, 200, 300, 400, 500, 350, 350, 350, 430),
             p = c(60, 90, 80, 75, 95, 600, 600, 600, 330),
             y = c(1, 1, 12, 11, 12,  13, 13, 13, 20))

if y is duplicated, then only keep the y with the highest a. If duplicated y have the same a, then keep y with the highest p. If duplicated y have the same a and p, then simply remove them. I did this:

dat %>% dplyr::arrange(y, a, p) %>% dplyr::group_by(y) %>% 
dplyr::filter(a == max(a)) %>% dplyr::filter(p == max(p)) %>%
distinct(a, p, y)

But how do I retain my year column here?

add `.keep_all = TRUE` to `distinct`. – missuse Oct 02 '18 at 09:49 — missuse, Oct 02 '18 at 09:49

kath · Accepted Answer · 2018-10-02T11:37:00.283

1

As already mentioned in the comments, you can specify .keep_all = TRUE in distinct.

You are selecting your rows based on several conditions which you could combine in less lines of codes with arrange and slice:

library(dplyr)

dat %>% 
  group_by(y) %>% 
  arrange(a, p) %>% 
  slice(n())

Or:

dat %>% 
  group_by(y) %>% 
  arrange(desc(a), desc(p)) %>% 
  slice(1)

Further if you'd like to select the highest value given a weight you could use top_n:

dat %>% 
  group_by(y) %>% 
  top_n(1, a) %>%
  top_n(1, p) %>% 
  distinct(y, a, p, .keep_all = TRUE)

The last two approaches will keep the year 1985 for y = 13, wheras the first approach keeps the year 1987. You might want to specify which year (max or min) you want to keep for duplicated values.

edited Oct 02 '18 at 11:37

answered Oct 02 '18 at 11:16

kath

7,624
17
32

It is better to use `TRUE` instead off `T`, [see also here](https://stackoverflow.com/q/18256639/2204410). – Jaap Oct 02 '18 at 11:35
1

Thanks! I wasn't aware of that. – kath Oct 02 '18 at 11:36

columns removed while removing duplicated rows in dplyr

1 Answers1