Get the most frequent value per row and account for ties

Question

Sample data:

df <- data.frame("ID" = 1:6, 
                 "Group1" = c("A", NA, "C", NA, "E", "C"), 
                 "Group2" = c("E", "C", "C", NA, "E", "E"),
                 "Group3" = c("A", "A", NA, NA, "C", NA),
                 "Group4" = c(NA, "C", NA, "D", "C", NA),
                 "Group5" = c("A", "D", NA, NA, NA, NA))

In each row, I want to count the number of each value and store the most frequent value in a new variable, New.Group. In case of ties, the first value in the row should be selected. The logic applied to the example:

Row 1 of New.Group takes value A because it is most frequent value in the row, ignoring NAs.

Row 2 takes value C because it is also the most frequent value.

Row 3 the same as Row 2.

Row 4 takes value D because it's the only value in the row.

In Row 5 both E and C has count 2, but E is selected because it is encountered before C in the row.

Row 6, similar to row 5, both C and E has count 1, but C is selected because it is encountered before E in the row.

The desired output:

  ID Group1 Group2 Group3 Group4 Group5 New.Group
1  1      A      E      A   <NA>      A         A
2  2   <NA>      C      A      C      D         C
3  3      C      C   <NA>   <NA>   <NA>         C
4  4   <NA>   <NA>   <NA>      D   <NA>         D
5  5      E      E      C      C   <NA>         E
6  6      C      E   <NA>   <NA>   <NA>         C

first column is more important than second, second than third and so on. — Laura, Jul 21 '20 at 18:00
Related, but doesn't seem to handle the "_In case of ties, select the first value_": [Find the most frequent value by row](https://stackoverflow.com/questions/19982938/find-the-most-frequent-value-by-row) — Henrik, Jul 21 '20 at 19:43
@Henrik For one of our internal packages I've made `rowMode`-function in which you can set what the function should do with ties and missing values. [Posted it here](https://stackoverflow.com/a/63100111/2204410). — Jaap, Jul 26 '20 at 12:29

Allan Cameron · Accepted Answer · 2020-07-21T19:52:56.023

6

I think this achieves what you're looking for. For each row, it creates a table of frequencies of each letter and chooses the largest, whilst preserving column order for ties. It then returns the name of the first column in this table.

Thanks to Henrik for suggesting the improvement.

df$New.Group <- apply(df[-1], 1, function(x) {
names(which.max(table(factor(x, unique(x)))))
})

df
#>   ID Group1 Group2 Group3 Group4 Group5 New.Group
#> 1  1      A      E      A   <NA>      A         A
#> 2  2   <NA>      C      A      C      D         C
#> 3  3      C      C   <NA>   <NA>   <NA>         C
#> 4  4   <NA>   <NA>   <NA>      D   <NA>         D
#> 5  5      E      E      C      C   <NA>         E
#> 6  6      C      E   <NA>   <NA>   <NA>         C

edited Jul 21 '20 at 19:52

answered Jul 21 '20 at 18:02

Allan Cameron

147,086
7
49
87

1

@Henrik yes, that's better. I thought I had tried that and rejected it for some reason, but I think it's logically the same (and neater). Thank you. – Allan Cameron Jul 21 '20 at 18:53
and it does account for ties – Laura Jul 23 '20 at 13:08

tmfmnk · Answer 2 · 2020-07-21T18:37:54.887

One option using dplyr and vctrs (utilizing the location parameter in vec_count() which "orders by location where key first seen") could be:

df %>%
 rowwise() %>%
 mutate(New.Group = na.omit(vec_count(c_across(starts_with("Group")), "location")) %>%
         slice_max(count, with_ties = FALSE) %>%
         pull(key))

     ID Group1 Group2 Group3 Group4 Group5 New.Group
  <int> <fct>  <fct>  <fct>  <fct>  <fct>  <fct>    
1     1 A      E      A      <NA>   A      A        
2     2 <NA>   C      A      C      D      C        
3     3 C      C      <NA>   <NA>   <NA>   C        
4     4 <NA>   <NA>   <NA>   D      <NA>   D        
5     5 E      E      C      C      <NA>   E        
6     6 C      E      <NA>   <NA>   <NA>   C

Or:

df %>%
 rowwise() %>%
 mutate(New.Group = names(which.max(with(na.omit(vec_count(c_across(starts_with("Group")), "location")), setNames(count, key)))))

score 2 · Answer 3 · answered Jul 21 '20 at 19:06

2

We can use the Mode function

Mode <- function(x) {
   ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
 }

df$New.Group <- apply(df[-1], 1, FUN = function(x) Mode(na.omit(x)))
df$New.Group
#[1] "A" "C" "C" "D" "E" "C"

answered Jul 21 '20 at 19:06

akrun

874,273
37
540
662

Get the most frequent value per row and account for ties

3 Answers3