I am trying to compare 2 data.frames, "V1" represents my CRM, "V2" represents Leads that I would like to send out.
'V1 has roughly 8k elements' 'V2 has roughly 25k elements'
I need to compare every row in V2 to every row in V1, discard every instance where a V2 element exists in V1.
I would then like to return only the elements that do not appear either exactly or loosely in V1 into the Leads column.
The goal is to send out a lead(V2) that does not exist in CRM(V1).
I've made some good progress with the stringdist package and divided 'soundex' by 'osa' to better my chances although this method still returns elements in V1.:(
This is the expected result I'm looking for in the Leads column, based on this example:
Leads: J.Jones Restoration A.W. Builders C&C Contractors
Any help would be greatly appreciated and I apologize if this is unclear in any way.
library(reprex)
library(tidyverse)
#> Warning: package 'tibble' was built under R version 3.6.2
library(tidystringdist)
df <- tibble::tribble(
~V1, ~V2,
"5th Generation Builder", "5th Generation Builder, LLC",
"5th Generation Builders Inc.", "5th Generation Builders",
"89 Contractors LLC", "89 Contractors LLC",
"906 Studio Architects LLC", "906 Studio Architects",
"A & A Glass Co.", "Paragon Const.",
"A & E Farm", "A & E Farm",
"A & H GLASS", "C & C Contractors",
"A & J Homeworks,Painting, and Restoration", "A.W. Builders",
"Paragon Const.", "J. Jones Restoration",
"A & L Construction", "A & L Const.")
tidy_e <- tidy_stringdist(df) %>%
filter(soundex>=1) %>%
select(-V1, V2) %>%
arrange(V2,osa) %>%
mutate(V2, sim = soundex/ osa) %>%
distinct(V2, osa, soundex, sim) %>%
rename('Leads'= 'V2')
Created on 2020-04-13 by the reprex package (v0.3.0)