R Function to identify non-matching rows

Question

I am trying to compare 2 data.frames, "V1" represents my CRM, "V2" represents Leads that I would like to send out.

'V1 has roughly 8k elements' 'V2 has roughly 25k elements'

I need to compare every row in V2 to every row in V1, discard every instance where a V2 element exists in V1.

I would then like to return only the elements that do not appear either exactly or loosely in V1 into the Leads column.

The goal is to send out a lead(V2) that does not exist in CRM(V1).

I've made some good progress with the stringdist package and divided 'soundex' by 'osa' to better my chances although this method still returns elements in V1.:(

This is the expected result I'm looking for in the Leads column, based on this example:

Leads: J.Jones Restoration A.W. Builders C&C Contractors

Any help would be greatly appreciated and I apologize if this is unclear in any way.

library(reprex)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
library(tidystringdist)

df <- tibble::tribble(
  ~V1,  ~V2,
  "5th Generation Builder", "5th Generation Builder, LLC",
  "5th Generation Builders Inc.",   "5th Generation Builders",
  "89 Contractors LLC", "89 Contractors LLC",
  "906 Studio Architects LLC",  "906 Studio Architects",
  "A & A Glass Co.",    "Paragon Const.",
  "A & E Farm", "A & E Farm",
  "A & H GLASS",    "C & C Contractors",
  "A & J Homeworks,Painting, and Restoration",  "A.W. Builders",
  "Paragon Const.", "J. Jones Restoration",
  "A & L Construction", "A & L Const.")

tidy_e <- tidy_stringdist(df) %>% 
  filter(soundex>=1) %>% 
  select(-V1, V2) %>% 
  arrange(V2,osa) %>% 
  mutate(V2, sim = soundex/ osa) %>% 
  distinct(V2, osa, soundex, sim) %>% 
  rename('Leads'= 'V2')

^{Created on 2020-04-13 by the reprex package (v0.3.0)}

The current structure of `df` has the same number of items in V1 and V2, but your description is that you have 8K in V1 and 25K in V2. How did you create the `df` variable from your two datasets? — David Robinson, Apr 13 '20 at 22:33
This is just a sample that represents the general sticking point. Would a link to the Dataset be helpful? — sbaumbaugh, Apr 14 '20 at 00:43
My problem seems rather simple and I used to have someone manually take care of this, but that luxury has disappeared. I do hope someone can shed some light on a more elegant, efficient way to handle this. Would the full dataset be more helpful? @DavidRobinson — sbaumbaugh, Apr 14 '20 at 12:46
Quite an overwhelmed community, maybe someone can at least tell me where I might look for the answer? — sbaumbaugh, Apr 15 '20 at 02:41
I share a potential solution below! Thanks for asking a reproducible and well-formed question — David Robinson, Apr 15 '20 at 03:12

score 0 · Accepted Answer · answered Apr 15 '20 at 03:11

You can use the fuzzyjoin package, designed for joining tables based on inexact matching such as string distance. (Disclaimer is that I'm the maintainer).

If you had your data in two separate tables V1 and V2:

V1 <- tibble(name = c("5th Generation Builder", "5th Generation Builders Inc.", "89 Contractors LLC", 
                      "906 Studio Architects LLC", "A & A Glass Co.", "A & E Farm", 
                      "A & H GLASS", "A & J Homeworks,Painting, and Restoration", "Paragon Const.", 
                      "A & L Construction"))

V2 <- tibble(name = c("5th Generation Builder, LLC", "5th Generation Builders", "89 Contractors LLC", 
                      "906 Studio Architects", "Paragon Const.", "A & E Farm", "C & C Contractors", 
                      "A.W. Builders", "J. Jones Restoration", "A & L Const."))

Then you could use stringdist_anti_join() to find the ones in V2 that don't have a soundex match in V1:

V2 %>%
  stringdist_anti_join(V1, by = "name", method = "soundex")

Results:

# A tibble: 3 x 1
  name                
  <chr>               
1 C & C Contractors   
2 A.W. Builders       
3 J. Jones Restoration

See this vignette for more on the stringdist_ joins.

Note that if you'd wanted to see which each matched to, you could use stringdist_left_join():

V2 %>%
  stringdist_left_join(V1, by = "name", method = "soundex")

# A tibble: 12 x 2
   name.x                      name.y                      
   <chr>                       <chr>                       
 1 5th Generation Builder, LLC 5th Generation Builder      
 2 5th Generation Builder, LLC 5th Generation Builders Inc.
 3 5th Generation Builders     5th Generation Builder      
 4 5th Generation Builders     5th Generation Builders Inc.
 5 89 Contractors LLC          89 Contractors LLC          
 6 906 Studio Architects       906 Studio Architects LLC   
 7 Paragon Const.              Paragon Const.              
 8 A & E Farm                  A & E Farm                  
 9 C & C Contractors           NA                          
10 A.W. Builders               NA                          
11 J. Jones Restoration        NA                          
12 A & L Const.                A & L Construction

Thank you so much David, It works perfect on the sample and I've set up the working df's the same. The first pass V1/V2 of 8913 obs, yielded just 42 unique results. with a couple NA's in the output. A filter should remedy. I've combed through your tidy tuesday YouTube channel and have picked up some good info. I can't thank you enough for the help Sir.@DavidRobinson — sbaumbaugh, Apr 15 '20 at 12:40

R Function to identify non-matching rows

1 Answers1

Linked

Related