Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
2
votes
1 answer

Fuzzy Matching with Strings containing numbers

I am trying approximate matches between the reference and the target strings. I have tried adist and stringdist in R with the various distances available. While the algorithms do a good job of matching strings with only alphabets it fails to match…
darkage
  • 857
  • 3
  • 12
  • 22
2
votes
0 answers

How to use NLP / string manipulation to recode multiple columns of state/city/foreign locations

VERY appreciative of help!!! I have some very dirty data I am trying to clean up. Looking for an elegant solution in R that will correctly identify if there is foreign travel or not (TRUE = foreign travel, FALSE = domestic/USA travel). There are…
Ellie
  • 415
  • 7
  • 16
2
votes
1 answer

Getting the closest string matches between two lists

I am a real beginner in R and I just have this two lists with names of cities in them. One list has user-generated names (people spell messy) and another list with the orthography of the names. I tried using the package stringdist, and I ended up…
2
votes
2 answers

Merging two dataframes by stringmatch with dplyr and stringdist

I'm attempting to do a dplyr left join on two dataframes based on greatly similar language (that's not exact). DF1: title | records Bob's show, part 1 | 42 Time for dinner | 77 Horsecrap | 121 DF2: showname | counts Bob's show part 1 | 772 Dinner…
Christopher Penn
  • 539
  • 4
  • 14
2
votes
1 answer

Jaro-Winkler's difference between packages

I am using fuzzy matching to clean up medication data input by users, and I am using Jaro-Winkler's distance. I was testing which package with Jaro-Winkler's distance was faster when I noticed the default settings do not give identical values. Can…
Andrew
  • 5,028
  • 2
  • 11
  • 21
2
votes
3 answers

Quick way to count number of position match of a given character between all rows pairwise

I have a matrix and I want to identify the number of times that each character appears in the same position between all pairwise. A example of the way I'm doing is below, but my matrix has 10,000 rows and it's taking too long. # This code will…
celacanto
  • 315
  • 2
  • 11
2
votes
0 answers

I'm trying to use the "stringdist" to fuzzy match company names between two data frames, but it's not working very good, what can be done?

I have a data frame with 5 million different company names, many of them refer to the same company spelled in different ways or with misspellings. I use a company name "Amminex" as an example here and then try to stringdist it to the 5 million…
WoeIs
  • 1,083
  • 1
  • 15
  • 25
2
votes
1 answer

Quick search in data.table or quick subset

I have a DF with 800k+ rows with repeated (random) values. For each row I need to take a value and find an index of a new row(s) with same value. E.g. "asd" - where else do I see it? The index of the current row is NOT needed. My current solution:…
Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
2
votes
1 answer

R: Correct strings by distance measure (stringdistmatrix)

I am dealing with the problem that I need to count unique names of people in a string, but taking into consideration that there may be slight typos. My thought was to set strings below a certain threshold (e.g. levenshtein distance below 2) as…
moabit21
  • 639
  • 8
  • 20
2
votes
2 answers

R fuzzy string match to return specific column based on matched string

I have two large datasets, one around half a million records and the other one around 70K. These datasets have address. I want to match if any of the address in the smaller data set are present in the large one. As you would imagine address can be…
user1412
  • 709
  • 1
  • 8
  • 25
2
votes
1 answer

Remove rows containing identical or word-permuted sentences from a data frame in R

I have a data frame with text TERM good morning hello morning good you're welcome hello hi I would like to filter out all duplicates and all with the same words but in different order. So that I get: TERM good morning hello you're welcome hi I…
JoergP
  • 1,349
  • 2
  • 13
  • 28
2
votes
1 answer

Why does R stringdist return Inf in q-gram distance with one string shorter than q?

I understand that the q-gram distance is the sum of absolute differences between q-gram vectors of both strings. But I see some weird behavior when one of the strings is shorter than the chosen q. So for these two strings, while the qgrams function…
Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72
2
votes
1 answer

Compare item in one row against all other rows and loop through all rows using data.table - R

I'm combining similar names using stringdist(), and have it working using lapply, but it's taking 11 hours to run through 500k rows and I'd like to see if a data.table solution would work faster. Here's an example and my attempted solution so far…
Luke Macaulay
  • 393
  • 5
  • 14
2
votes
2 answers

String distance metrics that is in favor of substring, and word order independent?

For my data analytics problem, I usually needs to regulate names, that names A, and B, I'd consider them the same or very similar, if A and B share substantial number of common substrings, regardless of the order of those substring. For example,…
Yu Shen
  • 2,770
  • 3
  • 33
  • 48
2
votes
1 answer

Finding similar rows (not duplicates) in a dataframe in R

I have a dataset of >800k rows (example): id fieldA fieldB codeA codeB 120 Similar one addrs example1 929292 0006 3490 Similar oh addrs example3 929292 0006 2012 CLOSE CAA addrs example10232 kkda9a …
Rwak
  • 316
  • 1
  • 3
  • 11
1 2
3
10 11