Calculate minimal string distance and find row which minimizes distance

Question

I have a data frame with a column of class 'character'. I am trying to (a) create a new variable in some way summarizing how similar the value of a row in that column is to the most similar other value in the column and (b) identify the row of the most similar available value in that column for a given value in the column.

My existing approach is to calculate an edit distance measure using the stringdist package (https://cran.r-project.org/web/packages/stringdist/stringdist.pdf) except this seems to be incredibly computationally demanding and after hours of waiting still does not compute, but also it's not clear how to search for the smallest distance for each observation based on finding the distance of a given value from other values in the same vector. Furthermore, it doesn't appear to return the index of the most similar value.

Is there any somewhat computationally tractable way to develop a minimal distance measure for each observation and the comparison row for which the distance is minimized?

# Create data
data.frame(x = c("a","abbb","aa", "abbbkdjsfjldkfjldfkjl"))

# Want something like
data.frame(smallest_distance = c(1,20,1,90), closest_match = c(3,3,1,2))

Note: agrep seems to do this but (a) doesn't provide the distance, (b) it's not clear how to produce a tidy table of the data, and (c) it's not clear how to implement this for an entire column rather than looping through each value of the vector for the 'pattern' input: https://stat.ethz.ch/R-manual/R-devel/library/base/html/agrep.html — socialscientist, Jul 28 '17 at 07:00
The fact that is computationally expensive is not surprising, since it's going to be with N^2 (you have to match every pair). So, if you have a vector long ~1k elements, expect ~1M calculations. Things get rough very soon. However, `stringdistmatrix` gives the distance for every pair and then you can call `max.col` to know the index of the shortest distance. — nicola, Jul 28 '17 at 07:10

Calculate minimal string distance and find row which minimizes distance

0 Answers0