Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions

votes

1 answer

Best similarity distance metric for two strings

I have a bunch of company names to match, for example, I want to match this string: A&A PRECISION with A&A PRECISION ENGINEERING However, almost every similarity measure I use: like Hamming distance, Levenshtein distance, Restricted…

asked Nov 09 '19 at 17:03

imguessing

votes

0 answers

Postgres Join based on Levenshtein distance

I am trying to make use of the levenshtein distance in my join condition. Since the sqlalchemy doesn't provide the implementation within func module, I set the method stringdist.rdlevenshtein_norm to func.rdlevenshtein_norm and used it in my…

python postgresql sqlalchemy levenshtein-distance stringdist

asked Jul 26 '19 at 17:54

Vinay

votes

1 answer

R - String Distance with weighted words

Is there any way to weight specific words using the stringdist package or another string distance package? Often I have strings that share a common word such as "city" or "university" that get relatively close string distance matches as a result,…

r stringdist

asked May 24 '18 at 18:48

jzadra

4,012
2
26
46

votes

3 answers

Calculating string similarity as a percentage

The given function uses "stringdist" package in R and tells the minimum changes needed to change one string to another. I wish to find out how much similar is one string to another in "%" format. Please help me and thanks. stringdist("abc","abcd",…

r stringdist

asked Sep 27 '17 at 11:14

Ashmin Kaul

votes

1 answer

stringdist_join results in NAs

i am experimenting with the stringdist package in order to make fuzzy joins and i run into a problem which i do not understand and fail to find an answer for. I want to join these 2 data tables with the "dl" method and it produces a NA, which i…

r matching fuzzy stringdist fuzzyjoin

asked Sep 21 '17 at 14:41

Dome

votes

1 answer

Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

I was answering these two questions and got an adequate solution, but I had trouble passing arguments using fuzzy_join into the match_fun that I extracted from fuzzyjoin::stringdist_join. In this case, I'm using a mix of multiple match_fun's,…

r arguments parameter-passing stringdist fuzzyjoin

asked Jun 06 '17 at 07:10

Arthur Yip

5,810
2
31
50

votes

4 answers

stringdist on one vector

I'm trying to use stringdist to identify all strings with a max distance of 1 in the same vector, and then publish the match. Here is a sample of the data: Starting data frame: a = c("tom", "tomm", "alex", "alexi", "chris", "jen", "jenn",…

r stringdist

asked Jan 10 '17 at 02:38

richiepop2

votes

1 answer

Jaccard similarity in stringdist package to match words in character string

I would like to use the Jaccard similarity in the stringdist function to determine the similarity of bags of words. From what I can tell, using Jaccard only matches by letters within a character string. c <- c('cat', 'dog', 'person') d <-…

r text stringdist

asked May 10 '16 at 16:16

matsuo_basho

2,833
8
26
47

votes

1 answer

Levenshtein implementation capable working with large strings and vectors

There is package named stringdist in R which contains functions for computing Levenshtein string distance. I have two problems with this package: 1st It does not works with large strings e.g.: set.seed(1) a.str <- paste(sample(0:9, 100000, replace =…

python r perl levenshtein-distance stringdist

asked Apr 26 '16 at 11:33

Wakan Tanka

7,542
16
69
122

votes

1 answer

r stringdist or levenshtein.distance to replace strings

I have a large, dataset with ~ one million observations, keyed with a defined observation type. Within the dataset, there are ~900,000 observations with malformed observation types, with ~850 (incorrect) variations of the 50 acceptable observation…

regex r gsub levenshtein-distance stringdist

asked Oct 22 '15 at 15:05

Andrew M

votes

1 answer

Word-level edit distance between two sentences in R

I am looking for a fast solution in R for determining word-level edit distance between two sentences. More specifically, I want to determine minimal number of additions, substitutions or deletions of words, to transform sentence A to sentence B. For…

r data-mining text-mining stringdist

asked Mar 05 '15 at 11:56

JackONeill

votes

2 answers

Convert to matrix but keep one diagonal to NULL in R

I have a huge dataset and that look like this. To save some memory I want to calculate the pairwise distance but leave the upper diagonal of the matrix to NULL. library(tidyverse) library(stringdist) #> #> Attaching package: 'stringdist' #> The…

r matrix dplyr tidyverse stringdist

asked Mar 01 '22 at 09:00

LDT

2,856
2
15
32

votes

1 answer

Extract strings based on multiple patterns

I have thousands of DNA sequences that look like this :). ref <- c("CCTACGGTTATGTACGATTAAAGAAGATCGTCAGTC", "CCTACGCGTTGATATTTTGCATGCTTACTCCCAGTC", "CCTCGCGTTGATATTTTGCATGCTTACTCCCAGTC") I need to extract every sequence between the CTACG …

r gsub stringr stringdist

asked Dec 18 '21 at 13:58

LDT

2,856
2
15
32

votes

0 answers

Subset dataframe and batch process through existing stringdist function in parallel in R

I have inherited a function to run a fuzzy match between two sets of names using the stringdist package to calculate the distance between two string variables and select the match with the smallest distance. This is fine and wonderful and works…

r parallel-processing subset stringdist

asked Feb 03 '21 at 04:34

Bort Edwards

votes

1 answer

Replace string with most frequent fuzzy match

I have a dataframe of unstructured names, and I want to create a 'master' list of the cleaned name in one column with all the variants in another column. I am using the stringdist package. Below is a small example: library(dplyr) # for pipes…

r stringdist

asked Feb 05 '20 at 18:25

Francisco

Prev 1

…

10 11 Next