Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
0
votes
1 answer

Application of Text Mining in R on large dataset

The below R script computes the percentage similarity between two strings of text in columns "names1" and "names2". However, my requirement is to perform the same operation on 6k-10K+ column items. When the below Formula gets applied on such a big…
Adam Shaw
  • 519
  • 9
  • 24
0
votes
1 answer

Computing similarity % in text strings by excluding the identical entries in R

the given R script computes the similarity in % between two names as shown in the visual. Here we have two columns "names1" and "names2" with their respective ids in id1 and id2. My requirement is that when we execute the script, each name in…
Adam Shaw
  • 519
  • 9
  • 24
0
votes
0 answers

R string-based matching of business names

TL;DR I'd like to match two unequal columns where the values contain business names, and I've tried stringdist's amatch using Jaro-Winkler matching to get close, but not nearly close enough. I am wondering if stringi would be useful here - I just…
0
votes
1 answer

'R' Search for values in matrix

I have printed out a matrix with stringdistmatrix(c(). Works well, but now I need R to show me all cases with a value <=3. How can I search for those values in the matrix? Thanks in advance!
Fox
  • 41
  • 4
0
votes
1 answer

R String match for address using stringdist, stringdistmatrix

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be…
user1412
  • 709
  • 1
  • 8
  • 25
0
votes
0 answers

Function to find Max value using stringdist R

I'm trying to write a function that finds the Max value for each row of a column using stringdist. I have a line of code like this to find the max value of the 1st row but need help getting every…
jjm25
  • 1
  • 2
0
votes
0 answers

adist in R to match fuzzy strings

I have two excel sheets with insurance claims data from two different insurance providers. I need to find cases of individuals that have filed claims under both providers. I would like to have something that pairs names if it seems likely that…
Amie
  • 103
  • 12
0
votes
0 answers

stringdist performance on windows vs. linux (red hat)

I recently developed a fuzzy-string-matching routine on a Windows box in R. I was really pleased by the speed. Now I try to run the same procedure on a virtual redhat server and it is much slower, i.e. by a factor of approx. 100. The whole procedure…
exilsaxo
  • 1
  • 1
0
votes
1 answer

Jaccard distance between tweets

I'm currently trying to measure the Jaccard Distance between tweets in a dataset This is where the dataset is http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json I've tried a few things to measure the distance This is what I have…
user3577397
  • 453
  • 3
  • 12
  • 27
0
votes
0 answers

How to fuzzy match text in a column and then replace with a consensus in R

I have a dataframe as follows FName LName Ayeko Seki Ayeko Seki Ayeko Seki Ayeko Zeki Aveko Seki Avoo Zooki Jacques Bergmann. Jacques Burgman J Bergman Jacques Bergmann Jacques Bergmann Jacques Bergmann Jacques Bergmann David …
Sebastian Zeki
  • 6,690
  • 11
  • 60
  • 125
0
votes
3 answers

String distance matrix by criteria

I have written a script to do some fuzzy matching of company names. I'm matching a number of not-always-completely-correct company names (i.e. there might be small spelling mistakes or the "inc." suffix is missing) up against a corpus of "correct"…
Morten Nielsen
  • 325
  • 2
  • 4
  • 19
0
votes
2 answers

Using stringdist in R

Lets say I have the following words: word1 = 'john lennon' word2 = 'john lenon' word3 = 'lennon john' Its almost clear that these 3 words are reffering to the same person. Having the following code: library(stringdist) >stringdist('john…
Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
0
votes
1 answer

Using stringsim in stringdist

I'm using the package stringdist to compare some vectors of strings but I keep getting a different answer than what I think I should when I try to test out the package. I want to do this: stringsim('PANDIAN', 'PANIAN', method="lv") [1]…
grad_student
  • 317
  • 1
  • 5
  • 13
0
votes
1 answer

R - stringdist cost setting error

I have an error when I try to set the operations costs in stringdist Any ideas why ? library(stringdist) seq = rbind( c('aaa'), c('aba'), c('aab'), c('ccc') ) This works perfectly (Levensthein distance) stringdistmatrix(a = seq, b…
giac
  • 4,261
  • 5
  • 30
  • 59
0
votes
0 answers

String matching of variables with white spaces using stringdist package

I am trying to match the strings in a dataset with jaro distance. The problem is I am getting strings with white spaces as matches. Here is the data: df1 <- data.frame(ID1=c("london.inc","USA","UK","ball"," "),ID2=c("london.in","US","UKS","bull","…
user3570187
  • 1,743
  • 3
  • 17
  • 34
1 2 3
10
11