Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions

votes

1 answer

Fuzzy Matching with Strings containing numbers

I am trying approximate matches between the reference and the target strings. I have tried adist and stringdist in R with the various distances available. While the algorithms do a good job of matching strings with only alphabets it fails to match…

r fuzzy-logic fuzzy-comparison stringdist

asked Feb 04 '20 at 06:20

darkage

votes

0 answers

How to use NLP / string manipulation to recode multiple columns of state/city/foreign locations

VERY appreciative of help!!! I have some very dirty data I am trying to clean up. Looking for an elegant solution in R that will correctly identify if there is foreign travel or not (TRUE = foreign travel, FALSE = domestic/USA travel). There are…

r nlp levenshtein-distance grepl stringdist

asked Sep 25 '19 at 11:54

Ellie

votes

1 answer

Getting the closest string matches between two lists

I am a real beginner in R and I just have this two lists with names of cities in them. One list has user-generated names (people spell messy) and another list with the orthography of the names. I tried using the package stringdist, and I ended up…

r dataframe spell-checking stringdist

asked May 03 '19 at 20:33

Gabriel Rangel

votes

2 answers

Merging two dataframes by stringmatch with dplyr and stringdist

r dplyr stringdist

asked May 02 '19 at 20:22

Christopher Penn

votes

1 answer

Jaro-Winkler's difference between packages

I am using fuzzy matching to clean up medication data input by users, and I am using Jaro-Winkler's distance. I was testing which package with Jaro-Winkler's distance was faster when I noticed the default settings do not give identical values. Can…

r fuzzy-comparison stringdist record-linkage

asked Oct 08 '18 at 17:24

Andrew

5,028
2
11
21

votes

3 answers

Quick way to count number of position match of a given character between all rows pairwise

I have a matrix and I want to identify the number of times that each character appears in the same position between all pairwise. A example of the way I'm doing is below, but my matrix has 10,000 rows and it's taking too long. # This code will…

r hamming-distance stringdist

asked Apr 11 '18 at 17:01

celacanto

votes

0 answers

I'm trying to use the "stringdist" to fuzzy match company names between two data frames, but it's not working very good, what can be done?

I have a data frame with 5 million different company names, many of them refer to the same company spelled in different ways or with misspellings. I use a company name "Amminex" as an example here and then try to stringdist it to the 5 million…

r stringdist

asked Mar 31 '18 at 09:40

WoeIs

1,083
1
15
25

votes

1 answer

Quick search in data.table or quick subset

I have a DF with 800k+ rows with repeated (random) values. For each row I need to take a value and find an index of a new row(s) with same value. E.g. "asd" - where else do I see it? The index of the current row is NOT needed. My current solution:…

r match stringdist

asked Mar 14 '18 at 01:30

Alexey Ferapontov

5,029
4
22
39

votes

1 answer

R: Correct strings by distance measure (stringdistmatrix)

I am dealing with the problem that I need to count unique names of people in a string, but taking into consideration that there may be slight typos. My thought was to set strings below a certain threshold (e.g. levenshtein distance below 2) as…

r stringr stringdist

asked Dec 16 '17 at 19:30

moabit21

votes

2 answers

R fuzzy string match to return specific column based on matched string

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be…

r merge data.table string-matching stringdist

asked Mar 12 '17 at 15:45

user1412

votes

1 answer

Remove rows containing identical or word-permuted sentences from a data frame in R

I have a data frame with text TERM good morning hello morning good you're welcome hello hi I would like to filter out all duplicates and all with the same words but in different order. So that I get: TERM good morning hello you're welcome hi I…

r dataframe stringdist

asked Dec 20 '16 at 13:38

JoergP

1,349
2
13
28

votes

1 answer

Why does R stringdist return Inf in q-gram distance with one string shorter than q?

I understand that the q-gram distance is the sum of absolute differences between q-gram vectors of both strings. But I see some weird behavior when one of the strings is shorter than the chosen q. So for these two strings, while the qgrams function…

r stringdist

asked Oct 19 '16 at 08:58

Giora Simchoni

3,487
3
34
72

votes

1 answer

Compare item in one row against all other rows and loop through all rows using data.table - R

I'm combining similar names using stringdist(), and have it working using lapply, but it's taking 11 hours to run through 500k rows and I'd like to see if a data.table solution would work faster. Here's an example and my attempted solution so far…

r performance data.table stringdist

asked Apr 29 '16 at 20:23

Luke Macaulay

votes

2 answers

String distance metrics that is in favor of substring, and word order independent?

For my data analytics problem, I usually needs to regulate names, that names A, and B, I'd consider them the same or very similar, if A and B share substantial number of common substrings, regardless of the order of those substring. For example,…

r string edit-distance stringdist

asked Mar 14 '15 at 09:33

Yu Shen

2,770
3
33
48

votes

1 answer

Finding similar rows (not duplicates) in a dataframe in R

I have a dataset of >800k rows (example): id fieldA fieldB codeA codeB 120 Similar one addrs example1 929292 0006 3490 Similar oh addrs example3 929292 0006 2012 CLOSE CAA addrs example10232 kkda9a …

r duplicates stringdist

asked Feb 11 '15 at 16:59

Rwak

Prev 1 2

…

10 11 Next