1

Goal: I have a table of records (for example people) with a dynamic number of variables (e.g., email, phone, birthday,..., others). I want to compare each row with every other row and sum the number of variables that match.

# Input
my_data <- tibble(person = c("A","B","C","E","F"),
                 email = c("A@me.com", "A@me.com", NA, NA, NA),
                 phone = c(NA, NA, NA, 801, 801),
               birthday = c("Jan1", "Jan1", NA, NA, NA))

# A tibble: 5 x 4
  person email    phone birthday
  <chr>  <chr>    <dbl> <chr>   
1 A      A@me.com    NA Jan1    
2 B      A@me.com    NA Jan1    
3 C      NA          NA NA      
4 E      NA         801 NA      
5 F      NA         801 NA  

Using tidyr::expand_grid, I can get every combination of records.

# Munging
my_data_a <- my_data %>%
  rename_with(~str_c(., "_a"), everything())
my_data_b <- my_data %>%
  rename_with(~str_c(., "_b"), everything())
grid <- expand_grid(my_data_a, my_data_b)

# A tibble: 25 x 9
   person_a email_a  phone_a birthday_a person_b email_b  phone_b birthday_b total
   <chr>    <chr>      <dbl> <chr>      <chr>    <chr>      <dbl> <chr>      <int>
 1 A        A@me.com      NA Jan1       A        A@me.com      NA Jan1           2
 2 A        A@me.com      NA Jan1       B        A@me.com      NA Jan1           2
 3 A        A@me.com      NA Jan1       C        NA            NA NA             0
 4 A        A@me.com      NA Jan1       E        NA           801 NA             0
 5 A        A@me.com      NA Jan1       F        NA           801 NA             0
 6 B        A@me.com      NA Jan1       A        A@me.com      NA Jan1           2
 7 B        A@me.com      NA Jan1       B        A@me.com      NA Jan1           2
 8 B        A@me.com      NA Jan1       C        NA            NA NA             0
 9 B        A@me.com      NA Jan1       E        NA           801 NA             0
10 B        A@me.com      NA Jan1       F        NA           801 NA             0
# … with 15 more rows

Now I can compare each of the variables manually, but the issue is I will have more than email, phone, birthday.

grid %>%
  mutate(email_match = email_a == email_b,
         phone_match = phone_a == phone_b,
         birthday_match = birthday_a == birthday_b) %>%
  mutate(across(everything(), ~replace_na(., 0)),
         total = email_match + phone_match + birthday_match) %>%
  select(person_a, person_b, total)

# Output
   person_a person_b total
   <chr>    <chr>    <dbl>
 1 A        A            2
 2 A        B            2
 3 A        C            0
 4 A        E            0
 5 A        F            0
 6 B        A            2
 7 B        B            2
 8 B        C            0
 9 B        E            0
10 B        F            0
# … with 15 more rows

This can be done by brute force in a for loop, but the data set is large:

# Brute force
a_col_start <- 2
a_col_end <- ncol(grid)/2
b_col_start <- a_col_end + 2
b_col_end <- ncol(grid)
for (i in 1:nrow(grid)) {
  grid[i,"total"] <- sum(grid[i,a_col_start:a_col_end] == grid[i,b_col_start:b_col_end], na.rm = TRUE)
}
grid %>%
  select(person_a, person_b, total)
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
Jeff Parker
  • 1,809
  • 1
  • 18
  • 28

2 Answers2

1

You can use pmap function from package purrr for your purpose. It will make it easy to compare two vectors (in the same row) element-wise:

library(dplyr)
library(purrr)
library(stringr)


grid %>%
  mutate(total = pmap_dbl(grid, ~ sum(c(...)[str_detect(names(grid), "_a")][-1] == 
                        c(...)[str_detect(names(grid), "_b")][-1], na.rm = TRUE))) %>%
  select(contains("person"), total)


# A tibble: 25 x 3
   person_a person_b total
   <chr>    <chr>    <dbl>
 1 A        A            2
 2 A        B            2
 3 A        C            0
 4 A        E            0
 5 A        F            0
 6 B        A            2
 7 B        B            2
 8 B        C            0
 9 B        E            0
10 B        F            0
# ... with 15 more rows
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 1
    `purrr` to the rescue. I suspected a function in that package would be answer. – Jeff Parker May 21 '21 at 17:27
  • Yes exactly. `purrr`'s package functions are the first choices that come to my mind in col-wise and row-wise operations. In particular in your case we are dealing with 2 sets of variables in each row, I could not think of any other way. I already asked some questions regarding the applications of `pmap` thought you might be interested to take a look for future cases you might come across: https://stackoverflow.com/questions/67037099/using-pmap-function-to-check-whether-all-values-in-a-row-are-positive-or-negativ https://stackoverflow.com/questions/67049561/using-pmap-with-c-part-2 – Anoushiravan R May 21 '21 at 17:46
  • 1
    Side note: this started to bog down and get slow when the input was >1M rows on my laptop. But's it's quite quick <1M rows. I'm sure there's a CS reason for that for another day. I'll trim my input down in the meantime. – Jeff Parker May 21 '21 at 21:02
  • Maybe a `data.table` solution will do for data sets of more than 1M row. I unfortunately have not much experience in that area. However glad to hear it will do the job for almost 1M obs. – Anoushiravan R May 21 '21 at 21:11
0

If you only want unique combinations you can use combn() to get all pairwise combinations and use this as input to Map() to get the sum of matches for each pair of rows.

people <- combn(my_data$person, 2)

match_finder <- function(x, y) {
  personx <- my_data[my_data$person == x, ]
  persony <- my_data[my_data$person == y, ]
  match_sum <- sum(personx == persony, na.rm = TRUE)
  list(person1 = as.character(x), person2 = as.character(y), match_sum = match_sum)
  }

output <- Map(match_finder, people[1, ], people[2, ], USE.NAMES = FALSE)

as.data.frame(do.call(rbind, output))

   person1 person2 match_sum
1        A       B         2
2        A       C         0
3        A       E         0
4        A       F         0
5        B       C         0
6        B       E         0
7        B       F         0
8        C       E         0
9        C       F         0
10       E       F         1
fisher-j
  • 68
  • 7