Finding the most common pairs by row of 5 variables, summarized by group, in a df

Question

I am trying to find the most common pairs across 5 variables, very similar to this question. The main difference is I'd have one more variable that I'd like to group them by.

data.frame':   430 obs. of  6 variables:
 $ group: chr  "Celtics" "Pelicans" "Suns" ...
 $ X1  : int  7 9 22 15 34 11 21 35 33 43 ...
 $ X2  : int  22 16 31 40 49 15 11 13 41 50 ...
 $ X3  : int  30 17 36 32 29 36 41 34 1 2 ...
 $ X4  : int  48 29 8 45 21 9 6 6 18 8 ...
 $ X5  : int  16 39 32 12 27 43 12 15 23 7 ...

The output I'd like would look like this:

   group             Pair                   n
   <chr>             <dbl>                  <dbl>
 1 Suns              41-23                  30

I don't have a good enough grasp of using the combn function with group_by and a dplyr mutate to make this work yet. Any help would be appreciated.

score 0 · Answer 1 · answered Sep 20 '21 at 04:27

You can write a custom function (taking help from the previous answer)

return_pairs <- function(data, id) {
  vals <- sort(table(apply(data, 1, function(x) 
              combn(x, 2, paste, collapse="-"))), decreasing = TRUE) 
 
  data.frame(id = id, 
             pair = names(vals), 
             Freq = as.numeric(vals))
}

Split the data by group and apply the function.

library(purrr)
library(dplyr)

imap_dfr(split(df[-1], df$id), return_pairs) %>%
  group_by(id) %>%
  #to select  top 5 values for each id
  slice_max(Freq, n = 5)

#   id    pair  Freq   
#   <chr> <chr> <dbl>
# 1 1     4-4   12     
# 2 1     4-3   10     
# 3 1     1-1    8     
# 4 1     1-3    8     
# 5 1     2-3    7     
# 6 1     3-3    7     
# 7 1     4-1    7     
# 8 2     2-4   14     
# 9 2     3-4    9     
#10 2     4-1    9     
#11 2     4-3    9     
#12 2     4-4    9     
#13 3     3-2    7     
#14 3     2-3    6     
#15 3     4-4    6     
#16 3     2-2    5     
#17 3     2-4    5

data

It is easier to help if you provide data in a reproducible format

set.seed(1234)
df <- data.frame(id = rep(c(1, 2, 3), c(10, 10, 5)),
                 X1=sample(1:4, 25, replace=TRUE),
                 X2=sample(1:4, 25, replace=TRUE),
                 X3=sample(1:4, 25, replace=TRUE),
                 X4=sample(1:4, 25, replace=TRUE),
                 X5=sample(1:4, 25, replace=TRUE))

Finding the most common pairs by row of 5 variables, summarized by group, in a df

1 Answers1