1

I feel like this is such a simple thing, but I'm having so much trouble wrapping my head around it for some reason.

Say I have a long-format data frame of two categorical variables like so:

df <- data.frame(
  table_name = c("tbl1", "tbl1", "tbl1", "tbl2", "tbl2", "tbl2", "tbl3", "tbl3",
                 "tbl3"),
  column_name = c("first name", "address", "city", "first name", "last name",
                  "ice cream", "last name", "potato", "ice cream")
)

which looks like this:

df

table_name  column_name
tbl1        first name          
tbl1        address         
tbl1        city            
tbl2        first name          
tbl2        last name           
tbl2        ice cream           
tbl3        last name           
tbl3        potato          
tbl3        ice cream

I'd like to pivot_wider() to essentially create a matrix of matched column_name between all combinations of table_name. The expected output would be something like this:

     tbl1 tbl2 tbl3
tbl1    3    1    0
tbl2    1    3    2
tbl3    0    2    3

If it helps at all, the use case here is that I want to build a network graph using igraph. I've got a database that I want to visualize all the column_name connections among table_name.

2 Answers2

3

We could self-join, count matches by table_name, and pivot_wider:

library(tidyverse)
inner_join(df, df, join_by(column_name)) |>
  count(table_name.x, table_name.y) |>
  pivot_wider(names_from = table_name.y, values_from = n, values_fill = 0)

# A tibble: 3 × 4
  table_name.x  tbl1  tbl2  tbl3
  <chr>        <int> <int> <int>
1 tbl1             3     1     0
2 tbl2             1     3     2
3 tbl3             0     2     3
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • And you've done it! I would have never figured this out. Thank you so much for taking the time to do this. I couldn't get the idea of a group_by() and summarize() matches out of my head, and I knew I would going down the wrong path. Thanks again! – Thanatocoenose Aug 16 '23 at 18:02
2

We can pivot wider, convert to a matrix, and use matrix multiplication:

library(tidyr)
library(dplyr)
library(tibble)

mat = df |>
  mutate(value = 1) |>
  pivot_wider(names_from = column_name, values_from = value, values_fill = 0) |>
  column_to_rownames("table_name") |>
  as.matrix()

mat %*% t(mat)
#      tbl1 tbl2 tbl3
# tbl1    3    1    0
# tbl2    1    3    2
# tbl3    0    2    3
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294