0

I’m trying to write a function that will read in a large base table (example below) and check if any of the unique entities (ID) can be linked via 15+ attributes (bank a/c, phone num, email, zip code…etc). No fuzzy matching required this time.

df <- data.frame( id = c('01','02','03','04','05','06','07','08','09','10'),
bank_acc=c('66201','66202','66203','66204','66205','66205','66205','66206','66207','66208'),
phone_num=c('10151','10150','10152','10150','10153','10150','10154','10155','10156','10157'))

I need the output in an edgelist format (example below) so I can input into igraph, I plan to use the “Method” column to colour code the edges. Thanks in advance

ID  Linked_ID Method 
05  06   bank_acc
05  07   bank_acc
06  07   bank_acc
02  04   phone_num
02  06   phone_num
Sean Mc
  • 458
  • 6
  • 14

1 Answers1

1

You could split on the field of interest, compute all pairs (combn can be useful here), and then combine:

get.pairs <- function(colname) {
  spl <- split(df, df[,colname])
  do.call(rbind, lapply(spl, function(x) {
    if (nrow(x) == 1) {
      return(NULL)  # No duplicates for this value
    } else {
      combs <- combn(nrow(x), 2)
      return(data.frame(ID=x$id[combs[1,]], Linked_ID=x$id[combs[2,]], Method=colname, stringsAsFactors=F))
    }
  }))
}
do.call(rbind, lapply(c("bank_acc", "phone_num"), get.pairs))
#         ID Linked_ID    Method
# 66205.1 05        06  bank_acc
# 66205.2 05        07  bank_acc
# 66205.3 06        07  bank_acc
# 10150.1 02        04 phone_num
# 10150.2 02        06 phone_num
# 10150.3 04        06 phone_num
josliber
  • 43,891
  • 12
  • 98
  • 133