Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
1
vote
1 answer

Find matching groups of strings in R

I have a vector of about 8000 strings. Each element in the vector is a company name. My Objective My objective is to cluster these company names into groups, so that each cluster contains a group of company names that are similar to each other (For…
Varun
  • 1,211
  • 1
  • 14
  • 31
1
vote
1 answer

Using dplyr::mutate to loop through all available methods in stringdist

I am doing some fuzzy text matching to match school names. Here is an example of my data, which is two columns in a tibble: data <- tibble(school1 = c("abilene christian", "abilene christian", "abilene christian", "abilene christian"), …
Jenna Allen
  • 454
  • 3
  • 11
1
vote
1 answer

Displaying corresponding values in data frame in R

Please check the code below, I have created a data frame using three variables below, the variable "y123" computes the similarity between columns a2 with a1. The variable "y123" gives me total 16 values where every a1 value gets compared with a2. My…
Ashmin Kaul
  • 860
  • 2
  • 12
  • 37
1
vote
0 answers

User defined match terms for sting distance calculation in R

There are many choices of string distance calculation methods in R in package {stringdist} (https://cran.r-project.org/web/packages/stringdist/stringdist.pdf), very curious about if it is possible to include user defined match items by using regex…
Anne
  • 59
  • 6
1
vote
1 answer

text mining with r library stringdist

I have the next algorithm prepared for matching two strings. library(stringdist) qgrams('perimetrico','perimetrico peri',q=2) pe ri tr er im me o et ic co p V1 1 2 1 1 1 1 0 1 1 1 0 V2 2 3 1 2 1 1 1 1 1 1 1 As far as Im…
lolo
  • 646
  • 2
  • 7
  • 19
1
vote
0 answers

Approximate String matching exclude first character

I'm trying to do approximate String matching between lists of terms terms1 and terms2 where I want to match Strings including typos, different notations, etc. I'm using amatch(terms1, terms2, method="osa", maxDist=1, nomatch=0) I want to match…
Alec
  • 100
  • 1
  • 10
1
vote
1 answer

RecordLinkage - R one vector. Do not match to self

If I have one vector of names, say: a = c("tom", "tommy", "alex", "tom", "alexis", "Alex", "jenny", "Al", "michell") I want to get use levenshteinSim or similar to get a similarity score within this vector. However, I don't want it to self score.…
1
vote
1 answer

Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio

I have a data.table dt with 3 columns: id name as string threshold as num A sample is: dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6)) nid | rname | maxr n1 |…
user2590177
  • 167
  • 1
  • 11
1
vote
0 answers

Calculating pairwise string distance for big data

I'm comparing pairwise string distances for 8 million observations on 17 columns. Because I run into memory issues, I want to ask for help on a sub-setting technique or other methods to overcome this issue. In a different question on this website,…
wake_wake
  • 1,332
  • 2
  • 19
  • 46
1
vote
1 answer

In R - fastest way pairwise comparing character strings on similarity

I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks? Say I have the following data.frame: df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), …
wake_wake
  • 1,332
  • 2
  • 19
  • 46
1
vote
2 answers

Maintaining headers in edit distance

I am running edit distance using stringdist. The output replaces the input with a numbered list instead of the actual string being compared. This is currently what I have: library(stringdist) a <- c("foo", "bar", "bear", "boat", method =…
El David
  • 375
  • 2
  • 3
  • 11
1
vote
1 answer

Reshaping and sumarizing a data.frame based on partial match text (package stringdist)

I work on an old list names. The names of people are written differently but in reality, these are the same people. I used the stringdist package to compute the distance between strings to find wich names are probably the same. A small example of…
Wilcar
  • 2,349
  • 2
  • 21
  • 48
1
vote
2 answers

R look for abbreviation in full string

I'm looking for an efficient way in R to tell if one string might be an abbreviation for another. The basic approach I'm taking is to see if the letters in the shorter string appear in the same order in the longer string. For example, if my shorter…
chtongueek
  • 113
  • 2
  • 6
1
vote
1 answer

More efficient method for populating a matrix than nested for loops

Is there a more efficient way to achieve the following? library(dplyr) filers <- sapply(1:100, function(z) sample(letters, sample(1:20, 1), replace=T) %>% paste(collapse='')) %>% unlist() %>% unname() n <- length(unique(filers)) similarityMatrix <-…
tblznbits
  • 6,602
  • 6
  • 36
  • 66
1
vote
2 answers

How to create groups of like sounding names in R?

I'd like to create a group variables based upon how similar a selection of names is. I have started by using the stringdist package to generate a measure of distance. But I'm not sure how to use that output information to generate a group by…
Kath05
  • 180
  • 1
  • 8