I'm working with a dataset(df) which contains a column call job, where people just enter their job position.
The problem is because the data is typed manually so they contains a lot of misspelling errors. To do some calculations grouping by job, I'm trying to create a column called group, to group job with similar string together. For example:
Job | Jobgroup |
---|---|
Bartender | Bartender |
Barttender | Bartender |
Batendere | Bartender |
Engineer | Engineer |
Enginer | Engineer |
The jobgroup will be created base on the string distance method (jw method, in detail). I tried two appoach which give me quite the desired results. 1 is running a loop as follow:
library(stringdist)
for (i in seq(1:nrow(df))){
for (j in seq(i:nrow(df))){
if (df$group[j]=="nogroup" & ){ #space correct
if (stringdist(df$job[i],df$job[j],method="jw")<0.10){
df$group[j] <- df$group[i]
}
}
}
}
2 is using hierarchical classification using string distance with hclust() function. The 1st step of this one is to create a distance matrix(which won't work if I have 1.8mil rows) The problem is my dataset contains around 1.8 millions rows so both two approach above won't finish in even hours.
So I'm here looking for any ideas, propositions and experiences that can help me.