-1

I have a large dataset (300k) and there is a particular column I want to clean up - the names column. The problem is that the same name can have multiple unintentional spelling due to human input error, for example "Benjamin" spelled as "Banjamin".

I tried openrefine, but both ngram merge and key collision failed to resolve the above example, while the Leveshtein distance formula goes on forever and freezes up my computer. I tried Python, but the fuzzywuzzy + for loop again takes forever and doesn't seem feasible. R studio seemed to be a great choice however I'm not sure how to approach this? I tried the package 'refinr' but that is the same as openrefine and cannot handle "Benjamin" and "Banjamin" type misspellings. It is only good when there is a letter that's repeated or theres a space between the words.

Anyone have any suggestions as to what I should do?

Thanks

Ben
  • 59
  • 8
  • Welcome to SO! Please take some time to read up on how to [ask good question](https://stackoverflow.com/help/how-to-ask). I think a Google search for something like "fuzzy join in R" should give you a number of options. You will probably find the [fuzzyjoin](https://cran.r-project.org/web/packages/fuzzyjoin/) package particularly useful. –  Jun 14 '20 at 11:00
  • Do you have a list of the unintentional/incorrect spellings? TMM, the title should be "fuzzy matching" – Chris Ruehlemann Jun 14 '20 at 13:17
  • You are probably looking for something like the [stringdist](https://cran.r-project.org/web/packages/stringdist/) package that has functions for approximate string matching and distance. But given how extremely broad your question is, it isn't easy to give you a precise answer. All we can do is give you a few directions but you are going to have to figure the details out yourself. – Jan Jun 14 '20 at 13:34

2 Answers2

0

I think the fuzzyjoin-package will meet your desires.

df1 <- data.frame( name = c("James", "Benjamin", "Bert"), id = 1:3 )
df2 <- data.frame( name = c("Banjamin", "David", "Jemes"), id = 4:6 )


library( fuzzyjoin )
stringdist_join( df1, df2, by = "name")

#     name.x id.x   name.y id.y
# 1    James    1    Jemes    6
# 2 Benjamin    2 Banjamin    4
Wimpel
  • 26,031
  • 1
  • 20
  • 37
0

Complete self promotion, but I have written an R package, zoomerjoin, which uses MinHashing, allowing you to fuzzily join large datasets without having to compare all pairs of rows between the two dataframes. This means that you can merge moderately-sized (millions of rows) dataframes in seconds or minutes on a modern data-science laptop without running out of memory.

The example you give will still run slowly, as there are a lot of "correct" matches to check (each row in df_1 will match to thousands of rows in df_2) but in most use cases, the package will allow you to fuzzily match dataframes with millions of rows together.

Here's how i would use the package to join two smaller versions of the example dataframes you have provided:

library(tidyverse)
library(fuzzyjoin)

# Must have the Rust compiler installed to install this package
# see https://github.com/beniaminogreen/zoomerjoin for more details 
devtools::install_github("beniaminogreen/zoomerjoin")
library(zoomerjoin)

n <- 500
id_1 <- 1:n
names_1 <- sample(c("hannah", "marcus", "fred", "joe", "lara"), n, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
df_1 <- data.frame(id_1, names_1)
df_1$numberFound <- NA

id_2 <- 1:n
names_2 <- sample(c("hannah", "markus", "paul", "mary", "olivia"), n, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
anyNumber <- sample(c(123, 234, 345, 456, 567), n, replace = TRUE, prob = c(0.2, 0.2, 0.2, 0.2, 0.2))
df_2 <- data.frame(id_2, names_2, anyNumber)


joined_df_s <- lsh_inner_join(df_1, df_2, by=c("names_1" = "names_2"), n_gram_width = 1, threshold = .5)

The package is still in development (although its outputs have been tested against fuzzyjoin's for accuracy), so please don't hesitate to file an issue on github if there are areas of the documentation that are unclear, or if you find any bugs.