5

Sample data:

df <- data.frame("ID" = 1:6, 
                 "Group1" = c("A", NA, "C", NA, "E", "C"), 
                 "Group2" = c("E", "C", "C", NA, "E", "E"),
                 "Group3" = c("A", "A", NA, NA, "C", NA),
                 "Group4" = c(NA, "C", NA, "D", "C", NA),
                 "Group5" = c("A", "D", NA, NA, NA, NA))

In each row, I want to count the number of each value and store the most frequent value in a new variable, New.Group. In case of ties, the first value in the row should be selected. The logic applied to the example:

Row 1 of New.Group takes value A because it is most frequent value in the row, ignoring NAs.

Row 2 takes value C because it is also the most frequent value.

Row 3 the same as Row 2.

Row 4 takes value D because it's the only value in the row.

In Row 5 both E and C has count 2, but E is selected because it is encountered before C in the row.

Row 6, similar to row 5, both C and E has count 1, but C is selected because it is encountered before E in the row.

The desired output:

  ID Group1 Group2 Group3 Group4 Group5 New.Group
1  1      A      E      A   <NA>      A         A
2  2   <NA>      C      A      C      D         C
3  3      C      C   <NA>   <NA>   <NA>         C
4  4   <NA>   <NA>   <NA>      D   <NA>         D
5  5      E      E      C      C   <NA>         E
6  6      C      E   <NA>   <NA>   <NA>         C
Henrik
  • 65,555
  • 14
  • 143
  • 159
Laura
  • 306
  • 3
  • 12
  • 1
    How exactly are the column priorities defined? – tmfmnk Jul 21 '20 at 17:56
  • first column is more important than second, second than third and so on. – Laura Jul 21 '20 at 18:00
  • 1
    Related, but doesn't seem to handle the "_In case of ties, select the first value_": [Find the most frequent value by row](https://stackoverflow.com/questions/19982938/find-the-most-frequent-value-by-row) – Henrik Jul 21 '20 at 19:43
  • @Henrik For one of our internal packages I've made `rowMode`-function in which you can set what the function should do with ties and missing values. [Posted it here](https://stackoverflow.com/a/63100111/2204410). – Jaap Jul 26 '20 at 12:29
  • 1
    Thanks for the heads-up @Jaap. Nice! – Henrik Jul 26 '20 at 12:32

3 Answers3

6

I think this achieves what you're looking for. For each row, it creates a table of frequencies of each letter and chooses the largest, whilst preserving column order for ties. It then returns the name of the first column in this table.

Thanks to Henrik for suggesting the improvement.

df$New.Group <- apply(df[-1], 1, function(x) {
names(which.max(table(factor(x, unique(x)))))
})

df
#>   ID Group1 Group2 Group3 Group4 Group5 New.Group
#> 1  1      A      E      A   <NA>      A         A
#> 2  2   <NA>      C      A      C      D         C
#> 3  3      C      C   <NA>   <NA>   <NA>         C
#> 4  4   <NA>   <NA>   <NA>      D   <NA>         D
#> 5  5      E      E      C      C   <NA>         E
#> 6  6      C      E   <NA>   <NA>   <NA>         C
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
3

One option using dplyr and vctrs (utilizing the location parameter in vec_count() which "orders by location where key first seen") could be:

df %>%
 rowwise() %>%
 mutate(New.Group = na.omit(vec_count(c_across(starts_with("Group")), "location")) %>%
         slice_max(count, with_ties = FALSE) %>%
         pull(key))

     ID Group1 Group2 Group3 Group4 Group5 New.Group
  <int> <fct>  <fct>  <fct>  <fct>  <fct>  <fct>    
1     1 A      E      A      <NA>   A      A        
2     2 <NA>   C      A      C      D      C        
3     3 C      C      <NA>   <NA>   <NA>   C        
4     4 <NA>   <NA>   <NA>   D      <NA>   D        
5     5 E      E      C      C      <NA>   E        
6     6 C      E      <NA>   <NA>   <NA>   C 

Or:

df %>%
 rowwise() %>%
 mutate(New.Group = names(which.max(with(na.omit(vec_count(c_across(starts_with("Group")), "location")), setNames(count, key)))))
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
2

We can use the Mode function

Mode <- function(x) {
   ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
 }

df$New.Group <- apply(df[-1], 1, FUN = function(x) Mode(na.omit(x)))
df$New.Group
#[1] "A" "C" "C" "D" "E" "C"
akrun
  • 874,273
  • 37
  • 540
  • 662