Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
0
votes
2 answers

count the transpositions needed to a string so that it can be found in another string

Here is what I am trying to do: When the term I am analyzing is "apples", I would like to know how many transpositions are needed to "apples" so that it can be found in a string. "buy apples now" => 0 transposition needed (apples is present). "cheap…
Julien Massardier
  • 1,326
  • 1
  • 11
  • 29
0
votes
1 answer

Is there a way to check whether two string are approximately the same?

Consider the following two strings: applesauce and apple-sauce . These are referring to the same object. Thus any record containing these two names would be considered duplicates. However, in R, these are considered as separate levels. Could one…
NebulousReveal
  • 562
  • 2
  • 7
  • 19
0
votes
1 answer

Successively agrep names in a variable, then create a new variable with the shortest name for close matches

Assume a character vector of company names where the names come in various forms. Here is a small version of 10,000 row data frame; it shows the desired second vector ("two.names"). structure(list(firm = structure(1:8, .Label = c("Carlson Caspers",…
lawyeR
  • 7,488
  • 5
  • 33
  • 63
0
votes
1 answer

R Relevant match between 2 huge data sets. Even with Spelling Mistakes

I have input "I am travelling on my own, I have just brought a world ticket to go to singapore, darwin, perth, adelaide, melbourne, brisbane, gold cost, sydney Opra, christchurch,gold coast Richland, Aukland,Austrlia, and fji. It is a 10 month…
user3619015
  • 176
  • 1
  • 1
  • 9
0
votes
3 answers

efficient programming in R

I have a data like author_id paper_id confirmed author_name1 author_affiliation1 author_name 826 25733 1 Emanuele Buratti Genetic engineering Emanuele Buratti 826 25733 1 Emanuele Buratti …
user3171906
  • 543
  • 2
  • 9
  • 17
-1
votes
1 answer

Replace duplicates in matrix

i have the following test-code for you: ####TESTING HERE test = tibble::tribble( ~Name1, ~Name2, ~Name3, "Paul Walker", "Paule Walkr", "Heiko Knaup", "Ferdinand…
Max H.
  • 49
  • 6
-1
votes
4 answers

Standardize the City Name in R

I am new in R and coding world, pardon if i perhaps mispelled some or more jargon here (cmiiw). I am facing a challenge to clean city name in a dataframe. Tried to use GetCloseMatches, strdist_inner_join (with fuzzywuzzy i believe) with dplyr style…
rgoei
  • 1
  • 3
-1
votes
1 answer

Computing edit distance using two simple columns from iris dataset

In the following code below, I want to compute similarity between two columns of text strings.To achieve this, I take first 10 rows of "Petal.Length" column from iris and assign it to a1 , and first 4 rows from "Sepal.Length" column from iris and…
Ashmin Kaul
  • 860
  • 2
  • 12
  • 37
-1
votes
1 answer

merging data.frame rows based on similar strings in r

I have one data.frame with multiple columns. The first column contains company names. These have been entered by users and many values contain similar strings representing the same entity. For example Company A Pty. Company A Pty. Ltd. Company A…
-2
votes
1 answer

How to identify occurrences of similar addresses in character vector of R

I have a dataframe containing one address column. Same addresses are incorrectly spelled and counted as unique. I want to identify and calculate frequencies of similar addresses. I need new dataframe with following columns: Address and number of…
-2
votes
1 answer

How to cluster the similar texts in R

I know similar question might have asked in this/different forum but I feel my requirement is different. I have 2 columns dataframe as shown in below: Verbatim LowestlevelTerm Acute Bronchitis Acute Bronchitis Sinusitis Maxillaris…
-4
votes
2 answers

How to programmatically find variations of a specific word in a sentence?

Sometimes the data you get is not clean and has variations of the words used, misspelled or manipulated. Can we find such instances of closest resemblance of the words in a sentence? For instance, if i am looking out for word "Awesome" which has…
Mindfreak
  • 17
  • 6
1 2 3
10
11