Create a matrix with lowest cell count for every pair of binary variables

Question

I have a dataset with several binary variables (x1-x5, values: 1, 2, NA). My goal is to identify whether pairs of binary variables have zero or very low cell counts in the cross-tab table (after ignoring the missing values). So, I would like to calculate the cross-tab table for each pair of binary variables in my data set, extract the lowest value from each table, and report the lowest value from each cross-table into a matrix. By doing so, I would have something similar to a correlation matrix where, instead of correlation coefficients, I would be able to look at the lowest cell count for each pair of variables. Below I created a toy dataset for anyone who will decide to help.

library(tidyverse)
    
df <- data.frame(x1 = rbinom(n = 1000, size = 1, prob = 0.5),
                     x2 = rbinom(n = 1000, size = 1, prob = 0.3),
                     x3 = rbinom(n = 1000, size = 1, prob = 0.4),
                     x4 = rbinom(n = 1000, size = 1, prob = 0.2),
                     x5 = rbinom(n = 1000, size = 1, prob = 0.05)) |> 
      mutate(across(everything(), ~as.factor(.))) |> 
      mutate(across(everything(), ~recode(., "1" = "2", "0" = "1")))
    
df1 <- as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))

Ben Bolker · Accepted Answer · 2022-07-18T20:50:53.477

2

I think this is what you mean. It's inefficient (we should only compute one triangle) but short.

cfun <- function(i, j) {
   min(table(df[[i]], df[[j]]))
}
outer(1:ncol(df), 1:ncol(df), Vectorize(cfun))

If you want to be more efficient:

n <- ncol(df)
m <- matrix(NA_integer_, n, n, dimnames = list(names(df), names(df)))
for (i in 1:(n-1)) {
   for (j in (i+1):n) {
      m[j,i] <- cfun(i,j)
   }
}

edited Jul 18 '22 at 20:50

answered Jul 18 '22 at 20:34

Ben Bolker

211,554
25
370
453

1

@dcsuka, if your (now deleted) answer works, you might as well include it. Diversity is good. – Ben Bolker Jul 18 '22 at 20:47
This is great! Exactly what I was looking for. Couple of minor changes, if you can: 1) How do I include the name of the variables in the matrix (both for columns and rows)? Also, can we use the lower triangle of the matrix instead of the upper triangle? (Just a personal preference). – Michael Matta Jul 18 '22 at 20:49
1

see edits ...... – Ben Bolker Jul 18 '22 at 20:51

score 2 · Answer 2 · answered Jul 18 '22 at 23:08

Someone (probably @dcsuka) suggested another solution but then deleted it from the answer section. Thankfully, I had already saved it in my script. After tweaking the code a tiny bit, it returned the correct results. So I am copying it here because, as Ben said, diversity is good.

df2 <- df1 %>%
  colnames() %>%
  combn(2) %>%
  t() %>%
  as_tibble(.name_repair = ~c("var1", "var2"))

df3 <- df2 %>%
  rowwise() %>%
  mutate(crosstab = list(as_tibble(table(select(df1, var1, var2)))),
         value = min(list(select(crosstab, n))[[1]])) %>%
  select(-crosstab) %>% 
  pivot_wider(names_from = var1, values_from = value)

Create a matrix with lowest cell count for every pair of binary variables

2 Answers2