Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
0
votes
1 answer

Clustering similar strings in a big dataset

My data is similar to the following one comp_name perm_id GM Global Technologies Operations LLC 16002 GM Global Technologies Operations, Inc. …
Enes
  • 31
  • 3
0
votes
2 answers

joining on inexact strings in R

I am looking to join two tables.. however the data I am looking to join on does not match exactly.. joining on NFL player names.. data sets below.. > dput(att75a) structure(list(rusher_player_name = c("A.Ekeler", "A.Jones", "A.Kamara",…
0
votes
0 answers

Distance/Fuzzy matching 2 columns with another 2 columns in R

in my simplified example I have a dataframe with four different columns. I want to be able to match main_name and main_dob together with secondary_name and secondary_dob. The actual order of the rows doesn't matter, so if there is a match in row 3…
Joey
  • 1
0
votes
0 answers

How to calculate the similarity between 2 String columns using TF IDF in R

It might be similar question would have asked in this forum but I feel my requirement is peculiar . I have a data frame df1 where it consists of variable "WrittenTerms" with 40,000 observations and I have another data-fame df2 with variable…
0
votes
1 answer

Calculating similarity between two vectors/Strings in R

It might be similar question asked in this forum but I feel my requirement peculiar. I have a data frame df1 where it consists of variable "WrittenTerms" with 40,000 observations and I have another data-fame df2 with variable "SuggestedTerms" with…
0
votes
0 answers

Appending data with columns that have similar names using pattern matching

I have five excel files with 2 sheets per file. -file_2015: a, b -file_2016: a, b -file_2017: a, b -file_2018: a, b Both sheets a and b provide the same data over time. They vary in the number of columns because new indicators are added to newer…
brin
  • 35
  • 1
  • 6
0
votes
0 answers

Is there a way to speed up the R package stringdist significantly, e.g. by using Rcpp?

I have a loop, in which I have to calculate a distance between one string and a vector of many strings. I use the package "stringdist" and the function of the same name, which works well. However, it takes quite some time to calculate the distances…
Nicholas
  • 93
  • 1
  • 9
0
votes
1 answer

Recycling error while using stringdist and data.table in R

I am trying to perform an approximate string matching for a data.table containing author names basis a dictionary of "first" names. I have also set a high threshold say above 0.9 to improve the quality of matching. However, I get an error message…
ds_newbie
  • 79
  • 8
0
votes
2 answers

How can I fuzzy string match multiple strings from different sized data frames?

I would like to match the strings from my first dataset with all of their closest common matches. Data looks like: dataset1: California Texas Florida New York dataset2: Californiia callifoornia T3xas Te xas texas Fl0 rida folrida New york new…
Tim
  • 11
  • 1
0
votes
1 answer

R - return n matches via levenshtein distance

I would like to find the n best matches to a given string via levenshtein distance. I know that the adist function in R gives the minimal distance, but I am attempting to scale the number of results to, say, 10. I have some code below. name <-…
jvalenti
  • 604
  • 1
  • 9
  • 31
0
votes
1 answer

using stringdist for two dataset with crossjoin in R

stringdist work with vector stringdist("ca","abc") [1] 3 but i want match two dataset first structure(list(id = structure(c(5L, 2L, 4L, 3L, 6L, 1L, 7L), .Label = c("SOFT Ватные палочки 100 ПЭ (БЭЛЛ", "Лимоны 55+", "МАКФА макароныоны перья любит.…
d-max
  • 167
  • 13
0
votes
0 answers

Fuzzy matching without 'master table'

Is it possible to perform some type of fuzzy matching without having a table of desired results? For example, standardising these rows: Lord Philip Harris Lord Harris of Peckham Lord Philip C. Harris Philip Lord C Harris Lord Phillip Harris of…
deethreenovice
  • 127
  • 1
  • 2
  • 17
0
votes
1 answer

subset dataframe where two digits are interchanged in R

I have below mentioned datafram: df <- read.table(text = "code Num mail identifier U_id YY-12 12345 jjf@gmail.com ar145j U-111 YY-13 12345 jjf@gmail.com Ra145J U-111 YY-14 …
Jupiter
  • 221
  • 1
  • 12
0
votes
1 answer

String fuzzy matching in dataframe

I have a dataframe containing the title of an article and the url links associated. My problem is that the url link is not necessary in the row of the corresponding title, example: title | urls …
ML_Enthousiast
  • 1,147
  • 1
  • 15
  • 39
0
votes
0 answers

Efficient way to calculate cosine similarity by ignoring for loop

I am trying to calculate cosine similarity using stringdist function from stringdist package in R. I want to get average cosine similarity for each row in scoring_dt by calculating cosine similarity with each row of baseline_dt and taking mean for…
Rushabh Patel
  • 2,672
  • 13
  • 34