R: how to find the 10 most common pairs by row across 5 variables in a df

Question

I have a dataframe that looks like this:

'data.frame':   430 obs. of  8 variables:
 $ date: chr  "12 June 2020" "05 June 2020" "29 May 2020" "22 May 2020" ...
 $ X1  : int  7 9 22 15 34 11 21 35 33 43 ...
 $ X2  : int  22 16 31 40 49 15 11 13 41 50 ...
 $ X3  : int  30 17 36 32 29 36 41 34 1 2 ...
 $ X4  : int  48 29 8 45 21 9 6 6 18 8 ...
 $ X5  : int  16 39 32 12 27 43 12 15 23 7 ...
 $ num1: int  8 1 6 7 8 9 2 5 6 3 ...
 $ num2: int  2 8 10 10 10 8 1 1 2 2 ...

I need to find the 10 most common pairs within num1 and num2; and the 10 most common pairs within X1, X2, X3, X4, X5.

For the first problem I have got

names(tail(sort(table(unlist(tapply(mydf$num1, mydf$num2, FUN=function(x) combn(unique(x), 2, paste, collapse="-"))))),10))

which produces the output

[1] "7-2" "7-4" "7-5" "8-3" "1-3" "3-1" "4-1" "3-2" "4-2" "1-2"

which works. But I can't transform the code to do the same for X1, X2, X3, X4, X5. I have tried the most obvious (to substitute mydf$num1 etc. to mydf$X1 etc), but I get all sorts of errors, I'm guessing because this time I'm trying to find pairs across 5 variables. Any suggestions as to how to do this would be most welcome.

=========

Edit: Not sure if this helps but this is what I mean:

In this dataset 13-34 is a pair and 41-23 is another. (There may be more, I just did it by looking at it - this is exactly what I am trying to avoid.) So I am trying to find which two numbers occur together and of these which are the most common pairs.

Can you please make a small reproducible example and show expected output based on that so that it is easier to understand what you are trying to do? — Ronak Shah, Jun 14 '20 at 14:01

Chris Ruehlemann · Accepted Answer · 2020-06-15T09:35:30.710

(Using @Martin Gal's data)

EDIT:

This solution pastes the rows of rolling pairs of columns together, using Mapand apply, then tables them and sorts them:

pairs <-
  Map(function(x, y) apply(df[, x:y], 1, paste0, collapse="-"), 1:(ncol(df) - 1), 2:ncol(df))

as.data.frame(sort(table(unlist(pairs)), decreasing = T)[1:10])
   Var1 Freq
1   2-2  648
2   3-1  648
3   2-3  644
4   3-2  641
5   1-4  639
6   3-3  638
7   4-2  633
8   1-3  629
9   4-4  621
10  3-4  618

Alternatively, to obtain the paired numbers you could use a for loop:

pairs <- c()
for(i in 1:ncol(df)){
  if (i < ncol(df)) {                         
    pairs[[i]] <- apply(df[,c(i,i+1)], 1, paste0, collapse = "-")
  } 
}

or, more elegantly, lapply:

pairs <- lapply(seq(df)[-1], function(x) paste(df[, x-1], df[, x], sep = "-"))

Thank you very much, @Chris Ruehlemann, this has worked exactly as needed. Much appreciate the help! — Reader 123, Jun 15 '20 at 18:51

score 2 · Answer 2 · answered Jun 14 '20 at 23:46

If I have understood correctly, you consider a pair only if they occur in the same row. Hopefully this will give you what you want :

cols <- paste0('X', 1:5)
sort(table(apply(df[cols], 1, function(x) combn(x, 2, paste, collapse="-"))), 
           decreasing = TRUE)[1:10]

#4-4 2-4 2-3 4-3 4-1 2-2 3-2 3-3 1-3 2-1 
# 27  24  21  21  18  17  16  15  14  14

data

set.seed(1234)
df <- data.frame(X1=sample(1:4, 25, replace=TRUE),
                 X2=sample(1:4, 25, replace=TRUE),
                 X3=sample(1:4, 25, replace=TRUE),
                 X4=sample(1:4, 25, replace=TRUE),
                 X5=sample(1:4, 25, replace=TRUE))

score 1 · Answer 3 · edited Jun 14 '20 at 13:51

1

Use the count function in dplyr with inputs of all the variables where you want to find the unique combinations. It has a built-in way to sort. Not only do you get the unique combinations, but you also get the counts.

library(dplyr)
mydf %>% count(X1, X2, X3, X4, X5, sort = TRUE)

edited Jun 14 '20 at 13:51

Martin Gal

16,640
5
21
39

answered Jun 14 '20 at 13:33

GeoffDS

1,221
5
19
31

I have just tried this but it has produced: X1 X2 X3 X4 X5 n 1 1 2 5 24 47 1 2 1 2 11 19 47 1 3 1 2 16 31 50 1 4 1 3 11 18 31 1 How does this tell me what the most common pairs are? What am I missing? – Reader 123 Jun 14 '20 at 13:38
Tried with the edit from @Martin Gal but it still does not produce the required output. For one thing, the column headed "n" always has 1 in it, which is obviously not true, given that the pair 1-2 appears 3 times; and the pair 1-4 appears 4 times. I am only looking for pairs and this code does not give that. – Reader 123 Jun 14 '20 at 14:00

score 1 · Answer 4 · answered Jun 14 '20 at 14:50

1

Not really sure what do you want. Let's take this data.frame:

set.seed(1234)
df <- data.frame(X1=sample(1:4, 2500, replace=TRUE),
                 X2=sample(1:4, 2500, replace=TRUE),
                 X3=sample(1:4, 2500, replace=TRUE),
                 X4=sample(1:4, 2500, replace=TRUE),
                 X5=sample(1:4, 2500, replace=TRUE))

Using dplyr:

df %>%
  mutate(class=str_c(X1, "-", X2, "-", X3, "-", X4, "-", X5)) %>%
  count(class, name="count") %>%
  slice_max(count, n=10)

gives you

       class count
1  2-1-4-3-1     8
2  4-1-3-1-1     8
3  1-4-1-2-1     7
4  2-2-4-1-1     7
5  2-4-1-4-4     7
6  3-1-2-2-4     7
7  3-4-2-4-4     7
8  4-1-4-4-1     7
9  4-3-1-2-2     7
10 4-4-1-3-1     7

which are the top 10 combinations of X1 - X5.

Note: If there are more than 10 combinations with the same count they are all returned.

answered Jun 14 '20 at 14:50

Martin Gal

16,640
5
21
39

Thank you very much for taking the time to help, @Martin. The above is great - but I only need pairs by row. Two numbers. So e.g. in the output of the 10 top combination above, 3&1 occur 5 times together (in rows 1,2,6,9,10); 2&1 occur 6 times together, 2&4 occur 7 times together, etc. (My data does not repeat the same number in one row, so out of 1-2 and 2-1 only one can occur.) I hope this makes sense.... – Reader 123 Jun 14 '20 at 15:12
@Reader123 Please edit your question and give an example of your desired output based on the data you provided. I don't think I understand what you are looking for. – Martin Gal Jun 14 '20 at 15:14
I have just done it - hope it helps. Thanks again for taking the time! – Reader 123 Jun 14 '20 at 15:37

R: how to find the 10 most common pairs by row across 5 variables in a df

4 Answers4

Linked