I am trying to find data entry errors in the names and locations of my dataset by fuzzy matching. I am have a unique key from the original data, siterow_id, and have made a new key, pi_key, where I already identified some hard matches. (no fuzzy matching). After running the fuzzy matching I get duplicate values. The matches from both the left and right side of the join for some of the siterow_id's. I can manually look at the data and see where this occurs and hard code to remove the rows. I want a more algorithmic way of doing this as I go to a larger dataset with many more matches.
I tried doing it this way but it removes the matches on the left and the right. If possible I would love a tidyverse way to do this and not a loop.
The table output is included below. You can see a duplicate in row 8 and 9.
for(site in three_letter_matches$siterow_id.x){
if (any(three_letter_matches$siterow_id.y == site)) {
three_letter_matches <- three_letter_matches[!three_letter_matches$siterow_id.y == site,]
}
}
pi_key.x siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 6309 1-9CH29M kim kevin san f~ united s~ 11870 1-HC3YY6
2 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-2QBRZ2
3 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3AHHSU
4 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3JYF8V
5 7567 1-CW4DXI bar jair ramat~ israel 8822 1-E3UILG
6 8822 1-E3UILG bar jair ramat~ israel 7567 1-CW4DXI
7 11870 1-HC3YY6 kim kevin san f~ united s~ 6309 1-9CH29M
8 12357 1-HUUEA6 lee hyojin daeje~ korea re~ 13460 1-IGKCPP
9 13460 1-IGKCPP lee hyo jin daeje~ korea re~ 12357 1-HUUEA6
I found another way to do it
update <- three_letter_matches[!is.na(match(three_letter_matches$siterow_id.x, three_letter_matches$siterow_id.y)),]
update %<>% arrange(last_name.x, first_name.x) %>%
filter(row_number() %% 2 != 0)
three_letter_matches_update <- three_letter_matches %>%
anti_join(update)
Still open to suggestions.