Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions

vote

1 answer

Find matching groups of strings in R

I have a vector of about 8000 strings. Each element in the vector is a company name. My Objective My objective is to cluster these company names into groups, so that each cluster contains a group of company names that are similar to each other (For…

asked Feb 26 '18 at 17:21

Varun

1,211
1
14
31

vote

1 answer

Using dplyr::mutate to loop through all available methods in stringdist

I am doing some fuzzy text matching to match school names. Here is an example of my data, which is two columns in a tibble: data <- tibble(school1 = c("abilene christian", "abilene christian", "abilene christian", "abilene christian"), …

r for-loop dplyr purrr stringdist

asked Feb 01 '18 at 16:57

Jenna Allen

vote

1 answer

Displaying corresponding values in data frame in R

Please check the code below, I have created a data frame using three variables below, the variable "y123" computes the similarity between columns a2 with a1. The variable "y123" gives me total 16 values where every a1 value gets compared with a2. My…

r dplyr stringdist record-linkage

asked Dec 07 '17 at 10:52

Ashmin Kaul

vote

0 answers

User defined match terms for sting distance calculation in R

There are many choices of string distance calculation methods in R in package {stringdist} (https://cran.r-project.org/web/packages/stringdist/stringdist.pdf), very curious about if it is possible to include user defined match items by using regex…

r function string-matching stringdist

asked Sep 13 '17 at 22:16

Anne

vote

1 answer

text mining with r library stringdist

I have the next algorithm prepared for matching two strings. library(stringdist) qgrams('perimetrico','perimetrico peri',q=2) pe ri tr er im me o et ic co p V1 1 2 1 1 1 1 0 1 1 1 0 V2 2 3 1 2 1 1 1 1 1 1 1 As far as Im…

r stringdist

asked Sep 07 '17 at 21:37

lolo

vote

0 answers

Approximate String matching exclude first character

I'm trying to do approximate String matching between lists of terms terms1 and terms2 where I want to match Strings including typos, different notations, etc. I'm using amatch(terms1, terms2, method="osa", maxDist=1, nomatch=0) I want to match…

r string-matching stringdist

asked Aug 31 '17 at 09:01

Alec

vote

1 answer

RecordLinkage - R one vector. Do not match to self

If I have one vector of names, say: a = c("tom", "tommy", "alex", "tom", "alexis", "Alex", "jenny", "Al", "michell") I want to get use levenshteinSim or similar to get a similarity score within this vector. However, I don't want it to self score.…

r levenshtein-distance fuzzy-logic stringdist record-linkage

asked Aug 16 '17 at 15:22

Rtab

vote

1 answer

Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio

I have a data.table dt with 3 columns: id name as string threshold as num A sample is: dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6)) nid | rname | maxr n1 |…

r data.table dplyr stringdist

asked Aug 08 '17 at 10:15

user2590177

vote

0 answers

Calculating pairwise string distance for big data

I'm comparing pairwise string distances for 8 million observations on 17 columns. Because I run into memory issues, I want to ask for help on a sub-setting technique or other methods to overcome this issue. In a different question on this website,…

r string-comparison stringdist bigdata

asked Feb 22 '17 at 20:13

wake_wake

1,332
2
19
46

vote

1 answer

In R - fastest way pairwise comparing character strings on similarity

I'm looking for a way to speed up the following approach. Any pointers are very welcome. Where are the bottlenecks? Say I have the following data.frame: df <- data.frame(names=c("A ADAM", "S BEAN", "A APPLE", "J BOND", "J BOND"), …

r string dataframe string-comparison stringdist

asked Feb 18 '17 at 22:24

wake_wake

1,332
2
19
46

vote

2 answers

Maintaining headers in edit distance

I am running edit distance using stringdist. The output replaces the input with a numbered list instead of the actual string being compared. This is currently what I have: library(stringdist) a <- c("foo", "bar", "bear", "boat", method =…

r edit-distance stringdist

asked Dec 23 '16 at 18:53

El David

vote

1 answer

Reshaping and sumarizing a data.frame based on partial match text (package stringdist)

I work on an old list names. The names of people are written differently but in reality, these are the same people. I used the stringdist package to compute the distance between strings to find wich names are probably the same. A small example of…

r dataframe dplyr reshape2 stringdist

asked Mar 17 '16 at 13:20

Wilcar

2,349
2
21
48

vote

2 answers

R look for abbreviation in full string

I'm looking for an efficient way in R to tell if one string might be an abbreviation for another. The basic approach I'm taking is to see if the letters in the shorter string appear in the same order in the longer string. For example, if my shorter…

regex r string stringdist

asked Nov 02 '15 at 17:20

chtongueek

vote

1 answer

More efficient method for populating a matrix than nested for loops

Is there a more efficient way to achieve the following? library(dplyr) filers <- sapply(1:100, function(z) sample(letters, sample(1:20, 1), replace=T) %>% paste(collapse='')) %>% unlist() %>% unname() n <- length(unique(filers)) similarityMatrix <-…

r performance matrix stringdist

asked Sep 13 '15 at 23:08

tblznbits

6,602
6
36
66

vote

2 answers

How to create groups of like sounding names in R?

I'd like to create a group variables based upon how similar a selection of names is. I have started by using the stringdist package to generate a measure of distance. But I'm not sure how to use that output information to generate a group by…

r grouping fuzzy-comparison stringdist

asked Aug 27 '15 at 20:22

Kath05

Prev 1 2 3

…

10 11 Next