Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions

vote

2 answers

R String similarity matrix

I am busy with a text analytic project on masses of complaints data. One of the issues with the data is that you get multiple synonyms of the same word, e.g. bill, billing, billed, bills etc. Normally I would create a word frequency list and…

r tm synonym stringdist

asked Dec 11 '14 at 03:20

RUser

vote

1 answer

Improving performance of script for (Levenshtein distance with weights) in R

I am doing a large amount of string comparisons using the Levenshtein distance measure, but because I need to be able to account for the spatial adjacency in the latent structure of the strings, I had to make my own script including a weight…

r performance levenshtein-distance stringdist

asked May 09 '14 at 09:12

Martin Petri Bagger

2,187
4
17
20

vote

1 answer

machine learning algorithm for spelling check

I have a list of medicine names(regular_list) and a list of new names(new_list).I want to check whether the names in the new_list are already present in the regular_list or not.The issue is that the names new_list could have some typo errors and I…

text machine-learning stringdist

asked Aug 22 '13 at 07:59

rohit

vote

1 answer

Write out results of for-loop of distance measures in matrix form in R

Suppose I have something like the following vector: text <- as.character(c("string1", "str2ing", "3string", "stringFOUR", "5tring", "string6", "s7ring", "string8", "string9", "string10")) I want to execute a loop that does pair-wise comparisons of…

r for-loop distance string-matching stringdist

asked Aug 05 '13 at 22:53

DV Hughes

votes

0 answers

Strange output of the `adist` fuction in R (string distance)

Why is output not equal to a 1 * 1 matrix here? EDIT : the strange behaviour comes from the diag function dist <- adist("errors", "eror", costs = c(1, 1, 1), counts = T) dist [,1] [1,] 2 attr(,"counts") , , ins [,1] [1,] 0 , ,…

r string stringdist damerau-levenshtein

asked Mar 07 '23 at 08:26

Julien

1,613
1
10
26

votes

2 answers

Joining dataframes on text strings using fuzzy string matching (stringdist_join())

I'm trying to join two datasets on based on the values of two variables. Both datasets have the same variable names/number of columns but may have a different number of rows. I want to join them based on a grouping variable ("SampleID") and a…

r stringdist fuzzyjoin

asked Mar 06 '23 at 22:29

JRock

votes

0 answers

R Exact and Fuzzy joins using multiple columns

I am working on data that contains a list of treatment providers in one file and rural-urban codes in another file. The ultimate goal is to link the rural-urban codes to the treatment centers' locations via county and state locations. I have tried…

r join stringdist

asked Mar 01 '23 at 20:50

aarsmith

votes

1 answer

Using stringdist in R with big dataset (1.8 millions rows)?

I'm working with a dataset(df) which contains a column call job, where people just enter their job position. The problem is because the data is typed manually so they contains a lot of misspelling errors. To do some calculations grouping by job, I'm…

r classification stringdist misspelling

asked Dec 11 '22 at 16:19

Tung Anh

votes

1 answer

Compare one column in database x to another column in database y and return a database z containing high likely matches

I want to take a list of Customer names and compare them to an internal database to find a high likely match and return a customer code So I would receive a list of customers like this: Cx Name Chicken C. Water Gmbh Computer ldt Food,…

r database matching stringdist

asked Sep 16 '22 at 12:53

NomNonYon

votes

1 answer

how to create loop for multiple output vectors with grabl function in stringdist

I'm trying to apply the grabl function of stringdist to a large character vector "testref". I want to check for whether the strings in another character vector "testtitle" can be found in "testref". However, grabl does only allow for a single string…

r loops stringdist

asked Aug 11 '22 at 13:12

Jonas

votes

0 answers

TERR - Spotfire Custom Column Expression not working using stringdist

This works fine in R Studio. library(stringdist) result <- afind(input1, input2, method="cosine") distance <- result[2] real_distance <- distance[[1]][1] output <- real_distance When I add it as a column expression in Spotfire, where input1 is a…

r spotfire stringdist terr

asked Jul 30 '22 at 17:23

smackenzie

2,880
7
46
99

votes

1 answer

R - stringdist aFind() method, no maxDist parameter

The documentation for aFind, specifies a maxDist paramater you can use, but there is no maxDist parameter you can pass into aFind? https://cran.r-project.org/web/packages/stringdist/stringdist.pdf using this code: result = afind(ae_target_term,…

r stringdist

asked Jul 30 '22 at 15:39

smackenzie

2,880
7
46
99

votes

0 answers

Cosine similarity between rows of two large dataframes in R

I've two dataframes DF1 and DF2. One of them is a very large DF. I've created examples DF1 and 2 like this: library(tidyverse) A<-rep(c('Mavs', 'Spurs', 'Lakers', 'Cavs', 'Suns'), 1000000) DF1<-data.frame(A) B<-rep(c('Rockets', 'Pacers',…

r levenshtein-distance cosine-similarity stringdist

asked May 06 '22 at 17:29

Elian Bourdin

votes

1 answer

Match two columns based on string distance in R

I have two very large dataframes containing names of people. The two dataframes report different information on these people (i.e. df1 reports data on health status and df2 on socio-economic status). A subset of people appears in both dataframes.…

r string matching string-matching stringdist

asked Mar 01 '22 at 11:11

srocco

votes

0 answers

Finding the pairwise distance of thousands of strings

My dataset looks like this but I have around 510^5 sequences/strings to compare pairwise and calculate their levenshtein distance. I understand that these would lead to a matrix of 510^5 * 5*10^5 elements. I have tried so far the following packages…

r string tidyverse parallel.foreach stringdist

asked Feb 22 '22 at 20:58

LDT

2,856
2
15
32

Prev 1 2 3

…

10 11 Next