String distance matrix by criteria

Question

I have written a script to do some fuzzy matching of company names. I'm matching a number of not-always-completely-correct company names (i.e. there might be small spelling mistakes or the "inc." suffix is missing) up against a corpus of "correct" company names and ID. Obviously the point is to correctly attach ID's to the not-always-correct company names.

Here are some grossly simplified version of the datasets I'm matching (I'm not using the zip-part yet, but will get back to it later):

df <- data.frame(zip = c("4760","5445", "2200"), company = c("company x", "company y", "company z"))
corpus <- data.frame(zip = c("4760","5445", "2200", "2200", "2200"), company = c("company x inc.", "company y inc.", "company z inc.", "company a inc.", "company b inc."), id = c(12121212, 23232323, 34343434, 56565656, 67676767))

df
   zip   company
1 4760 company x
2 5445 company y
3 2200 company z

corpus
   zip        company       id
1 4760 company x inc. 12121212
2 5445 company y inc. 23232323
3 2200 company z inc. 34343434
4 2200 company a inc. 56565656
5 2200 company b inc. 67676767

I then use the following piece of code to create a matrix of string distance

library(stringdist)
distance.method <- c("jw")

string.dist.matrix <- stringdistmatrix(tolower(corpus$company),
                                       tolower(df$company),
                                       method = distance.method,
                                       nthread = getOption("sd_num_thread"))

string.dist.matrix

          [,1]      [,2]      [,3]
[1,] 0.1190476 0.1798942 0.1798942
[2,] 0.1798942 0.1190476 0.1798942
[3,] 0.1798942 0.1798942 0.1190476
[4,] 0.1798942 0.1798942 0.1798942
[5,] 0.1798942 0.1798942 0.1798942

I then go ahead and match up the pairs of minimum distance. Normally I want to match maybe 4000 companies up a against a corpus of 4,5 mio. companies, which takes some computing power to say the least. I had the idea that instead of calculating string distance between all possible pairs, I would instead only calculate it for those who share a zip code. As I see it the result would be a way smaller amount of calculations and even more precision in the fuzzy matching for more complex cases than the ones I've illustrated here with my simplified data.

In short the resulting matrix I would want would be something like this:

     [,1]            [,2]              [,3]
[1,] 0.1190476       NA                NA
[2,] NA              0.1190476         NA
[3,] NA              NA                0.1190476
[4,] NA              NA                0.1798942
[5,] NA              NA                0.1798942

I just cant seem to figure out a way to do it. Any ideas?

WaltS · Accepted Answer · 2016-02-11T15:44:22.673

The approaches below use dplyr and starts with phiver's approach of joining the two data frames but then goes on to produce either a data frame similar to your string.dist.matrix or a data frame in a condensed "key value" form. I've added another company to your df data frame to include the case of multiple companies with the same df zip.

The distance matrix version is:

 df <- data.frame(zip = c("4760","5445", "2200","2200"), company = c("company x", "company y", "company z","company a"))
  corpus <- data.frame(zip = c("4760","5445", "2200", "2200", "2200"), company = c("company x inc.", "company y inc.", "company z inc.", "company a inc.", "company b inc."),
                       id = c(12121212, 23232323, 34343434, 56565656, 67676767))

    # large matrix version
    library(dplyr)
    dist_mat <- inner_join(corpus, df, by = "zip") %>%
      mutate(corpus_co=tolower(as.character(company.x)), df_co=tolower(as.character(company.y)), company.x=NULL, company.y=NULL) %>%
      group_by(zip) %>%
      do( { dist_df=data.frame(unique(.$corpus_co), 
                               stringdistmatrix(unique(.$corpus_co), unique(.$df_co), method=distance.method), stringsAsFactors=FALSE);
            colnames(dist_df) = c("corpus_co", unique(.$df_co));
            dist_df})

with the result

     zip      corpus_co company z company a company x company y
  (fctr)          (chr)     (dbl)     (dbl)     (dbl)     (dbl)
1   2200 company z inc. 0.1190476 0.1798942        NA        NA
2   2200 company a inc. 0.1798942 0.1190476        NA        NA
3   2200 company b inc. 0.1798942 0.1798942        NA        NA
4   4760 company x inc.        NA        NA 0.1190476        NA
5   5445 company y inc.        NA        NA        NA 0.1190476

However, with 4000 rows in your df matrix, the full string distance matrix is very large with many NA's. A more efficient version uses the gather function from the tidyr package to produce a result in the key value format. In this approach some variables form unique keys which then have associated values. The vignette for the tidyr package explains this in more detail. In your case the corpus company name and the df company name form the key and the string distance between their names is the value. This is done for each zip code so the full string distance matrix is never stored. You may also find this easier to work with for your subsequent analysis. The code differs from the previous version only in the last line.

library(tidyr)
dist_keyval <- inner_join(corpus, df, by = "zip") %>%
               mutate(corpus_co=tolower(as.character(company.x)), df_co=tolower(as.character(company.y)), company.x=NULL, company.y=NULL) %>%
               group_by(zip) %>%
               do( { dist_df=data.frame(unique(.$corpus_co), 
                               stringdistmatrix(unique(.$corpus_co), unique(.$df_co), method=distance.method), stringsAsFactors=FALSE);
                     colnames(dist_df) = c("corpus_co", unique(.$df_co));
                     gather(dist_df, key=df_co, value=str_dist, -corpus_co)})

which gives the result

    zip      corpus_co     df_co  str_dist
  (fctr)          (chr)     (chr)     (dbl)
1   2200 company z inc. company z 0.1190476
2   2200 company a inc. company z 0.1798942
3   2200 company b inc. company z 0.1798942
4   2200 company z inc. company a 0.1798942
5   2200 company a inc. company a 0.1190476
6   2200 company b inc. company a 0.1798942
7   4760 company x inc. company x 0.1190476
8   5445 company y inc. company y 0.1190476

Edited

The code to find the corpus_co which is the minimum distance from each df_co is:

 dist_min <- dist_keyval %>% group_by(zip, df_co) %>%
                slice(which.min(str_dist))

To add columns to the final result, you can join on the form of the company names which was used to do the string distance calculations (i.e. the lower case names) as follows:

final_result <- corpus %>% mutate(lower_co = tolower(as.character(company)))  %>%
            right_join(dist_min, by = c("zip", "lower_co" = "corpus_co") ) %>%
            select(c(df_co, company, id),  everything(), -lower_co)

which gives

      df_co        company       id  zip  str_dist
1 company a company a inc. 56565656 2200 0.1190476
2 company z company z inc. 34343434 2200 0.1190476
3 company x company x inc. 12121212 4760 0.1190476
4 company y company y inc. 23232323 5445 0.1190476

The last select shows how to re-arrange columns into a particular order.

Thanks for this to both you and Phiver. I'm still struggling to understand all of it, but I get the overall concept, and it works as it should. I've been trying to add more columns to the final result, but can't seem to figure out how to. I get that it's in that long chain that the last five columns are lost, but where exactly? — Morten Nielsen, Feb 10 '16 at 20:41

score 1 · Answer 2 · answered Feb 09 '16 at 20:20

I have some ideas. If you do not need your distance matrix, you could solve it like this. I used dplyr because I know that one better. You could split the code into pieces instead of one dplyr command. Or use data.table. That might even be faster.

Steps taken:

Join df and corpus with an inner join on zip. This removes all the unwanted records and you have company names next to each other.
calculate the distance between the company names
group by the original company
filter on minimum distance

These steps avoid the use of first creating a matrix and then looking for a minimum value or putting other values to NA.

library(stringdist)
library(dplyr)

df <- data.frame(zip = c("4760","5445", "2200"), company = c("company x", "company y", "company z"))
corpus <- data.frame(zip = c("4760","5445", "2200", "2200", "2200"), company = c("company x inc.", "company y inc.", "company z inc.", "company a inc.", "company b inc."), id = c(12121212, 23232323, 34343434, 56565656, 67676767))


distance.method <- c("jw")

combined_min_distance <- inner_join(df, corpus, by = "zip" ) %>% 
  mutate(distance = stringdist(tolower(combined$company.x),
                    tolower(combined$company.y),
                    method = distance.method,
                    nthread = getOption("sd_num_thread"))) %>% 
  group_by(company.x) %>% 
  filter(distance == min(distance))

combined_min_distance

     zip company.x      company.y       id  distance
  (fctr)    (fctr)         (fctr)    (dbl)     (dbl)
1   2200 company z company z inc. 34343434 0.1190476
2   4760 company x company x inc. 12121212 0.1190476
3   5445 company y company y inc. 23232323 0.1190476

I needed to remove "combined" from "combined$company.x" and "combined$company.y) — dca, Jan 19 '19 at 04:41

score 1 · Answer 3 · answered Feb 26 '16 at 17:09

You can use stringdist::amatch and avoid computing the full stringdist matrix.

df <- data.frame(zip = c("4760","5445", "2200"), company = c("company x", "company y", "company z"))
corpus <- data.frame(zip = c("4760","5445", "2200", "2200", "2200"), company = c("company x inc.", "company y inc.", "company z inc.", "company a inc.", "company b inc."), id = c(12121212, 23232323, 34343434, 56565656, 67676767))


i <- stringdist::amatch(df$company,corpus$company,maxDist=5)
merged <- data.frame(df$company,corpus$company[i])
merged

> merged
  df.company corpus.company.i.
1  company x    company x inc.
2  company y    company y inc.
3  company z    company z inc.

It is even better to do some string cleaning before, so you know that the distances are only caused by actual typos (note the lower maxDist).

lookup <- gsub(" inc.$","",corpus$company)
i2 <- stringdist::amatch(df$company,lookup,maxDist=2)
merged2 <- data.frame(df$company,corpus$company[i2])

String distance matrix by criteria

3 Answers3