Using stringdist in R with big dataset (1.8 millions rows)?

Question

I'm working with a dataset(df) which contains a column call job, where people just enter their job position.

The problem is because the data is typed manually so they contains a lot of misspelling errors. To do some calculations grouping by job, I'm trying to create a column called group, to group job with similar string together. For example:

Job	Jobgroup
Bartender	Bartender
Barttender	Bartender
Batendere	Bartender
Engineer	Engineer
Enginer	Engineer

The jobgroup will be created base on the string distance method (jw method, in detail). I tried two appoach which give me quite the desired results. 1 is running a loop as follow:

library(stringdist)
for (i in seq(1:nrow(df))){
     for (j in seq(i:nrow(df))){
         if (df$group[j]=="nogroup" & ){ #space correct
                if (stringdist(df$job[i],df$job[j],method="jw")<0.10){
                       df$group[j] <- df$group[i]
          }
       }
    }
}

2 is using hierarchical classification using string distance with hclust() function. The 1st step of this one is to create a distance matrix(which won't work if I have 1.8mil rows) The problem is my dataset contains around 1.8 millions rows so both two approach above won't finish in even hours.

So I'm here looking for any ideas, propositions and experiences that can help me.

Could you add some data to test. For example `dput(head(df))`. — TarJae, Dec 11 '22 at 16:29
Oh I don't know why my example's table got wrong format, I will edit it later with an example — Tung Anh, Dec 11 '22 at 22:32
A couple of options: Vectorize the inner loop. Use a package to correct the misspellings, for example "Hunspell". Perform the matching on unique values only. Assume the first few letters are spelled correctly and then use a divide an conquer algorithm for the matches. — Dave2e, Dec 12 '22 at 03:10
Bit late to this, but the size of the dataset is not the issue, it's the size of the number of unique entries in "Job" isn't it? This is presumably much smaller than the size of the dataset. What is the length of the vector of unique entries in Job? Cluster that then use a key in data.table to assign the recode. I have this same problem on a huge scale as well and would be happy to elaborate with more information. In brief, you are going to want pairwise distances treated as a graph which you can cluster using the Louvain or other community detection method in the igraph package. — Brooks Ambrose, Jun 04 '23 at 05:06
@Dave2e, fwiw jaro-winkler distance already does the discounting of the first few characters by inflating the distance if the first few letters are wrong. — Brooks Ambrose, Jun 04 '23 at 05:18
@BrooksAmbrose Thanks for you comment, I will take a look at the igraph package. In fact I had a temporary solution for this question, using filter to subset the dataset to multiple small one, which the loop could somehow run through without saturate the storage. I used first character to classify them, then in big group, I use number of words to group further. For information the problematics for me in this task isn't only fix mispelled labels but also group job written differently between man and woman (in French). So the final program tooks around 2hours to run but it work quite good. — Tung Anh, Jun 05 '23 at 08:06
@BrooksAmbrose I forgot to clarify, the size of dataset here represent already unique values, as I took it from a bigger dataset. — Tung Anh, Jun 05 '23 at 08:17

score 0 · Answer 1 · answered Dec 11 '22 at 20:22

Comparing each job position with every other job position would be to slow without using parallelization or optimized software as elasticsearch

maybe you could try on of the three following Ansatz:

If the number of groups would be less than 100 you can define the groups per hand an compute the distance between the groups an each job position.
As the job position are more or less cluster in a space (the same assumption because you decided to use hclust) you can try to calculate the occurrence of each letter in each job position and compare these numbers to get an approximation of the groups which may be accurate.
When you mix the first two you can start defining one or two job position, calculate each distance between these two an each job position and find the other group members. By repeating defining new groups for the not associated job positions you can iterative find out the groups

Thanks for ur comment. In my case, 1st and 2nd ansatz might not be effective, but 3rd one does. I will try that later. And concerning the elasticsearch, are there a similar package in R which allows me to do a parallel task? After posting this question, I found 2 other packages named foreach and future.apply, I haven't tested it but if if u did, I would love to hear your opinion. — Tung Anh, Dec 11 '22 at 22:40
I think parallelization is not necessary since `library(stringdist) data <- data.frame(text=rep("Test",18000000)) test <- stringdist(data$text,"Hallo","jw")` took only some seconds. The problem by comparing each job position with each other is a quadratic problem. Comparing with a single job title is a linear problem an though fast. — celimi, Dec 12 '22 at 19:08

Using stringdist in R with big dataset (1.8 millions rows)?

1 Answers1