Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions

votes

1 answer

Clustering similar strings in a big dataset

My data is similar to the following one comp_name perm_id GM Global Technologies Operations LLC 16002 GM Global Technologies Operations, Inc. …

r stringdist

asked Apr 03 '20 at 10:39

Enes

votes

2 answers

joining on inexact strings in R

I am looking to join two tables.. however the data I am looking to join on does not match exactly.. joining on NFL player names.. data sets below.. > dput(att75a) structure(list(rusher_player_name = c("A.Ekeler", "A.Jones", "A.Kamara",…

r string join data-cleaning stringdist

asked Feb 20 '20 at 23:14

sbarbarotta

votes

0 answers

Distance/Fuzzy matching 2 columns with another 2 columns in R

in my simplified example I have a dataframe with four different columns. I want to be able to match main_name and main_dob together with secondary_name and secondary_dob. The actual order of the rows doesn't matter, so if there is a match in row 3…

r matching fuzzy-logic stringdist jaro-winkler

asked Jan 14 '20 at 20:36

Joey

votes

0 answers

How to calculate the similarity between 2 String columns using TF IDF in R

It might be similar question would have asked in this forum but I feel my requirement is peculiar . I have a data frame df1 where it consists of variable "WrittenTerms" with 40,000 observations and I have another data-fame df2 with variable…

r pattern-matching similarity tf-idf stringdist

asked Oct 24 '19 at 05:04

Pavan kumar

votes

1 answer

Calculating similarity between two vectors/Strings in R

It might be similar question asked in this forum but I feel my requirement peculiar. I have a data frame df1 where it consists of variable "WrittenTerms" with 40,000 observations and I have another data-fame df2 with variable "SuggestedTerms" with…

r pattern-matching similarity cosine-similarity stringdist

asked Oct 21 '19 at 12:04

Pavan kumar

votes

0 answers

Appending data with columns that have similar names using pattern matching

I have five excel files with 2 sheets per file. -file_2015: a, b -file_2016: a, b -file_2017: a, b -file_2018: a, b Both sheets a and b provide the same data over time. They vary in the number of columns because new indicators are added to newer…

r dplyr tidyverse purrr stringdist

asked Aug 30 '19 at 14:51

brin

votes

0 answers

Is there a way to speed up the R package stringdist significantly, e.g. by using Rcpp?

I have a loop, in which I have to calculate a distance between one string and a vector of many strings. I use the package "stringdist" and the function of the same name, which works well. However, it takes quite some time to calculate the distances…

r rcpp stringdist

asked Jun 12 '19 at 13:25

Nicholas

votes

1 answer

Recycling error while using stringdist and data.table in R

I am trying to perform an approximate string matching for a data.table containing author names basis a dictionary of "first" names. I have also set a high threshold say above 0.9 to improve the quality of matching. However, I get an error message…

r data.table stringdist

asked Jun 12 '19 at 13:11

ds_newbie

votes

2 answers

How can I fuzzy string match multiple strings from different sized data frames?

I would like to match the strings from my first dataset with all of their closest common matches. Data looks like: dataset1: California Texas Florida New York dataset2: Californiia callifoornia T3xas Te xas texas Fl0 rida folrida New york new…

r string join stringdist

asked Apr 22 '19 at 22:59

Tim

votes

1 answer

R - return n matches via levenshtein distance

I would like to find the n best matches to a given string via levenshtein distance. I know that the adist function in R gives the minimal distance, but I am attempting to scale the number of results to, say, 10. I have some code below. name <-…

r dataframe tm levenshtein-distance stringdist

asked Nov 02 '18 at 21:21

jvalenti

votes

1 answer

using stringdist for two dataset with crossjoin in R

stringdist work with vector stringdist("ca","abc") [1] 3 but i want match two dataset first structure(list(id = structure(c(5L, 2L, 4L, 3L, 6L, 1L, 7L), .Label = c("SOFT Ватные палочки 100 ПЭ (БЭЛЛ", "Лимоны 55+", "МАКФА макароныоны перья любит.…

r string stringdist

asked Oct 13 '18 at 10:59

d-max

votes

0 answers

Fuzzy matching without 'master table'

Is it possible to perform some type of fuzzy matching without having a table of desired results? For example, standardising these rows: Lord Philip Harris Lord Harris of Peckham Lord Philip C. Harris Philip Lord C Harris Lord Phillip Harris of…

r fuzzy-logic stringdist

asked Sep 04 '18 at 16:35

deethreenovice

votes

1 answer

subset dataframe where two digits are interchanged in R

I have below mentioned datafram: df <- read.table(text = "code Num mail identifier U_id YY-12 12345 jjf@gmail.com ar145j U-111 YY-13 12345 jjf@gmail.com Ra145J U-111 YY-14 …

r stringdist

asked Jul 23 '18 at 14:44

Jupiter

votes

1 answer

String fuzzy matching in dataframe

I have a dataframe containing the title of an article and the url links associated. My problem is that the url link is not necessary in the row of the corresponding title, example: title | urls …

r fuzzy-logic stringdist record-linkage

asked Apr 08 '18 at 06:33

ML_Enthousiast

1,147
1
15
39

votes

0 answers

Efficient way to calculate cosine similarity by ignoring for loop

I am trying to calculate cosine similarity using stringdist function from stringdist package in R. I want to get average cosine similarity for each row in scoring_dt by calculating cosine similarity with each row of baseline_dt and taking mean for…

r parallel-processing data.table lapply stringdist

asked Jan 26 '18 at 21:04

Rushabh Patel

2,672
13
34

Prev 1 2 3

…

10 11 Next