Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
3
votes
1 answer

Best similarity distance metric for two strings

I have a bunch of company names to match, for example, I want to match this string: A&A PRECISION with A&A PRECISION ENGINEERING However, almost every similarity measure I use: like Hamming distance, Levenshtein distance, Restricted…
imguessing
  • 377
  • 1
  • 3
  • 9
3
votes
0 answers

Postgres Join based on Levenshtein distance

I am trying to make use of the levenshtein distance in my join condition. Since the sqlalchemy doesn't provide the implementation within func module, I set the method stringdist.rdlevenshtein_norm to func.rdlevenshtein_norm and used it in my…
Vinay
  • 952
  • 2
  • 10
  • 27
3
votes
1 answer

R - String Distance with weighted words

Is there any way to weight specific words using the stringdist package or another string distance package? Often I have strings that share a common word such as "city" or "university" that get relatively close string distance matches as a result,…
jzadra
  • 4,012
  • 2
  • 26
  • 46
3
votes
3 answers

Calculating string similarity as a percentage

The given function uses "stringdist" package in R and tells the minimum changes needed to change one string to another. I wish to find out how much similar is one string to another in "%" format. Please help me and thanks. stringdist("abc","abcd",…
Ashmin Kaul
  • 860
  • 2
  • 12
  • 37
3
votes
1 answer

stringdist_join results in NAs

i am experimenting with the stringdist package in order to make fuzzy joins and i run into a problem which i do not understand and fail to find an answer for. I want to join these 2 data tables with the "dl" method and it produces a NA, which i…
Dome
  • 60
  • 6
3
votes
1 answer

Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

I was answering these two questions and got an adequate solution, but I had trouble passing arguments using fuzzy_join into the match_fun that I extracted from fuzzyjoin::stringdist_join. In this case, I'm using a mix of multiple match_fun's,…
Arthur Yip
  • 5,810
  • 2
  • 31
  • 50
3
votes
4 answers

stringdist on one vector

I'm trying to use stringdist to identify all strings with a max distance of 1 in the same vector, and then publish the match. Here is a sample of the data: Starting data frame: a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn",…
richiepop2
  • 348
  • 1
  • 12
3
votes
1 answer

Jaccard similarity in stringdist package to match words in character string

I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters within a character string. c <- c('cat', 'dog', 'person') d <-…
matsuo_basho
  • 2,833
  • 8
  • 26
  • 47
3
votes
1 answer

Levenshtein implementation capable working with large strings and vectors

There is package named stringdist in R which contains functions for computing Levenshtein string distance. I have two problems with this package: 1st It does not works with large strings e.g.: set.seed(1) a.str <- paste(sample(0:9, 100000, replace =…
Wakan Tanka
  • 7,542
  • 16
  • 69
  • 122
3
votes
1 answer

r stringdist or levenshtein.distance to replace strings

I have a large, dataset with ~ one million observations, keyed with a defined observation type. Within the dataset, there are ~900,000 observations with malformed observation types, with ~850 (incorrect) variations of the 50 acceptable observation…
Andrew M
  • 101
  • 9
3
votes
1 answer

Word-level edit distance between two sentences in R

I am looking for a fast solution in R for determining word-level edit distance between two sentences. More specifically, I want to determine minimal number of additions, substitutions or deletions of words, to transform sentence A to sentence B. For…
JackONeill
  • 123
  • 1
  • 9
2
votes
2 answers

Convert to matrix but keep one diagonal to NULL in R

I have a huge dataset and that look like this. To save some memory I want to calculate the pairwise distance but leave the upper diagonal of the matrix to NULL. library(tidyverse) library(stringdist) #> #> Attaching package: 'stringdist' #> The…
LDT
  • 2,856
  • 2
  • 15
  • 32
2
votes
1 answer

Extract strings based on multiple patterns

I have thousands of DNA sequences that look like this :). ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC") I need to extract every sequence between the CTACG …
LDT
  • 2,856
  • 2
  • 15
  • 32
2
votes
0 answers

Subset dataframe and batch process through existing stringdist function in parallel in R

I have inherited a function to run a fuzzy match between two sets of names using the stringdist package to calculate the distance between two string variables and select the match with the smallest distance. This is fine and wonderful and works…
2
votes
1 answer

Replace string with most frequent fuzzy match

I have a dataframe of unstructured names, and I want to create a 'master' list of the cleaned name in one column with all the variants in another column. I am using the stringdist package. Below is a small example: library(dplyr) # for pipes…
Francisco
  • 169
  • 1
  • 9
1
2
3
10 11