0

I am trying to find data entry errors in the names and locations of my dataset by fuzzy matching. I am have a unique key from the original data, siterow_id, and have made a new key, pi_key, where I already identified some hard matches. (no fuzzy matching). After running the fuzzy matching I get duplicate values. The matches from both the left and right side of the join for some of the siterow_id's. I can manually look at the data and see where this occurs and hard code to remove the rows. I want a more algorithmic way of doing this as I go to a larger dataset with many more matches.

I tried doing it this way but it removes the matches on the left and the right. If possible I would love a tidyverse way to do this and not a loop.

The table output is included below. You can see a duplicate in row 8 and 9.

for(site in three_letter_matches$siterow_id.x){ 
  if (any(three_letter_matches$siterow_id.y == site)) {
    three_letter_matches <- three_letter_matches[!three_letter_matches$siterow_id.y == site,]
  }
}




  pi_key.x siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
  <chr>    <chr>        <chr>       <chr>        <chr>  <chr>     <chr>    <chr>       
1 6309     1-9CH29M     kim         kevin        san f~ united s~ 11870    1-HC3YY6    
2 7198     1-CJGRSZ     kim         jinseok      seoul  korea re~ 2952     1-2QBRZ2    
3 7198     1-CJGRSZ     kim         jinseok      seoul  korea re~ 2952     1-3AHHSU    
4 7198     1-CJGRSZ     kim         jinseok      seoul  korea re~ 2952     1-3JYF8V    
5 7567     1-CW4DXI     bar         jair         ramat~ israel    8822     1-E3UILG    
6 8822     1-E3UILG     bar         jair         ramat~ israel    7567     1-CW4DXI    
7 11870    1-HC3YY6     kim         kevin        san f~ united s~ 6309     1-9CH29M    
8 12357    1-HUUEA6     lee         hyojin       daeje~ korea re~ 13460    1-IGKCPP    
9 13460    1-IGKCPP     lee         hyo jin      daeje~ korea re~ 12357    1-HUUEA6

I found another way to do it

update <- three_letter_matches[!is.na(match(three_letter_matches$siterow_id.x, three_letter_matches$siterow_id.y)),]
update %<>% arrange(last_name.x, first_name.x) %>% 
  filter(row_number() %% 2 != 0)
three_letter_matches_update <- three_letter_matches %>% 
  anti_join(update)

Still open to suggestions.

dhbrand
  • 162
  • 3
  • 16

1 Answers1

1

Not the easiest problem, but there are a few ways to do this. The first that comes to mind for me is a bit slow (because it uses rowwise() which is equivalent to using map() or lapply()) is this one:

NOTE: This only works if siterow_id.x/y are character vectors. Won't work for factors.

three_letter_matches <- three_letter_matches %>%
  rowwise() %>%
  mutate(both_values = paste0(sort(c(siterow_id.x,siterow_id.y)),collapse = ",")) %>%
  ungroup() %>%
  distinct(both_values,.keep_all = TRUE) %>%
  select(-both_values)


# pi_key.x   siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
# 6309  1-9CH29M    kim kevin   san f~  united s~   11870   1-HC3YY6
# 7198  1-CJGRSZ    kim jinseok seoul   korea re~   2952    1-2QBRZ2
# 7198  1-CJGRSZ    kim jinseok seoul   korea re~   2952    1-3AHHSU
# 7198  1-CJGRSZ    kim jinseok seoul   korea re~   2952    1-3JYF8V
# 7567  1-CW4DXI    bar jair    ramat~  israel      8822    1-E3UILG
# 12357 1-HUUEA6    lee hyojin  daeje~  korea re~   13460   1-IGKCPP

Basically what I'm doing here is doing rowwise so that I work on one row at a time, then I take the site_row ids and sort them so that every row will have the same order, then I paste them together into a single string that is easy to compare for equivalence. Next I ungroup so that you are looking at all rows again (get rid of that rowwise). Then run a distinct to only keep the first row for each value in the new column but with the .keep_all option to keep all the columns. Then I cleanup by removing my extra column.

Adam Sampson
  • 1,971
  • 1
  • 7
  • 15