Select rows by ID with most matches

Question

I have a data frame like this:

df <- data.frame(id = c(1,1,1,2,2,3,3,3,3,4,4,4),
                 torre = c("a","a","b","d","a","q","t","q","g","a","b","c"))

and I would like my code to select for each id the torre that repeats more, or the last torre for the id if there isnt one that repeats more than the other, so ill get a new data frame like this:

df2 <- data.frame(id = c(1,2,3,4), torre = c("a","a","q","c"))

Karolis Koncevičius · Accepted Answer · 2018-08-21T19:01:23.273

You can use aggregate:

aggregate(torre ~ id, data=df,
  FUN=function(x) names(tail(sort(table(factor(x, levels=unique(x)))),1))
)

The full explanation for this function is a bit involved, but most of the job is done by the FUN= parameter. In this case we are making a function that get's the frequency counts for each torre, sorts them in increasing order, then get's the last one with tail(, 1) and takes the name of it. aggregate() function then applies this function separately for each id.

score 1 · Answer 2 · answered Aug 21 '18 at 19:00

You could do this using the dplyr package: group by id and torre to calculate the number of occurrences of each torre/id combination, then group by id only and select the last occurrence of torre that has the highest in-group frequency.

library(dplyr)
df %>% 
group_by(id,torre) %>% 
mutate(n=n()) %>% 
group_by(id) %>% 
filter(n==max(n)) %>%
slice(n()) %>% 
select(-n)
     id torre
  <dbl> <chr>
1     1     a
2     2     a
3     3     q
4     4     c

score 1 · Answer 3 · answered Aug 21 '18 at 19:21

An approach with the data.table package:

library(data.table)
setDT(df)[, .N, by = .(id, torre)][order(N), .(torre = torre[.N]), by = id]

which gives:

   id torre
1:  1     a
2:  2     a
3:  3     q
4:  4     c

And two possible dplyr alternatives:

library(dplyr)

# option 1
df %>% 
  group_by(id, torre) %>% 
  mutate(n = n()) %>% 
  group_by(id) %>% 
  mutate(f = rank(n, ties.method = "first")) %>% 
  filter(f == max(f)) %>% 
  select(-n, -f)

# option 2
df %>% 
  group_by(id, torre) %>% 
  mutate(n = n()) %>% 
  distinct() %>% 
  arrange(n) %>% 
  group_by(id) %>% 
  slice(n()) %>% 
  select(-n)

tmfmnk · Answer 4 · 2019-07-26T07:51:24.250

1

Yet another dplyr solution, this time using add_count() instead of mutate():

df %>%
  add_count(id, torre) %>% 
  group_by(id) %>% 
  filter(n == max(n)) %>% 
  slice(n()) %>% 
  select(-n)

# A tibble: 4 x 2
# Groups:   id [4]
     id torre
  <dbl> <fct>
1    1. a    
2    2. a    
3    3. q    
4    4. c

edited Jul 26 '19 at 07:51

answered Aug 21 '18 at 19:38

tmfmnk

38,881
4
47
67

Select rows by ID with most matches

4 Answers4