2

I have a large dataset (about 130 columns) that looks like this:

data <- data.frame(AA = c("Apple", "Banana", "0", "Cherry", "0"),
                   AB = c("0", "0", "Apple", "Cherry", "0"),
                   AC = c("0", "0", "0", "0", "Cherry"),
                   AD = c("Cherry", "0", "0", "Banana", "0"))

Can anyone help me find the number of exact matches between all possible pairs of columns? The output should look something like this:

> output
  AA_AB AA_AC AA_AD AB_AC AB_AD AC_AD
      3     2     3     2     2     2

I have tried using length(intersect(data$AA, data$AB)) but I don't know how to apply this to the whole dataset within one command. From searching other queries on Stackoverflow, I also tried first creating a dataframe with all possible column pairs, but I couldn't figure out how to actually make my output match my original column names to then apply the intersect function.

N <- ncol(data)
combos <- expand.grid(1:N,1:N) %>%
  filter(!Var1==Var2)
combos <- combos[!duplicated(t(apply(combos[1:2], 1, sort))),]
Djoustaine
  • 65
  • 7

1 Answers1

2

You can use combn to create combination of column names and count common values between them using length and intersect.

val <- combn(names(data), 2, function(x) 
             length(intersect(data[[x[1]]], data[[x[2]]])))

names(val) <- combn(names(data), 2, paste0, collapse =  '_')
val
#AA_AB AA_AC AA_AD AB_AC AB_AD AC_AD 
#    3     2     3     2     2     2 
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213