Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
1
vote
2 answers

R String similarity matrix

I am busy with a text analytic project on masses of complaints data. One of the issues with the data is that you get multiple synonyms of the same word, e.g. bill, billing, billed, bills etc. Normally I would create a word frequency list and…
RUser
  • 588
  • 1
  • 4
  • 17
1
vote
1 answer

Improving performance of script for (Levenshtein distance with weights) in R

I am doing a large amount of string comparisons using the Levenshtein distance measure, but because I need to be able to account for the spatial adjacency in the latent structure of the strings, I had to make my own script including a weight…
Martin Petri Bagger
  • 2,187
  • 4
  • 17
  • 20
1
vote
1 answer

machine learning algorithm for spelling check

I have a list of medicine names(regular_list) and a list of new names(new_list).I want to check whether the names in the new_list are already present in the regular_list or not.The issue is that the names new_list could have some typo errors and I…
rohit
  • 47
  • 2
  • 10
1
vote
1 answer

Write out results of for-loop of distance measures in matrix form in R

Suppose I have something like the following vector: text <- as.character(c("string1", "str2ing", "3string", "stringFOUR", "5tring", "string6", "s7ring", "string8", "string9", "string10")) I want to execute a loop that does pair-wise comparisons of…
DV Hughes
  • 305
  • 2
  • 5
  • 22
0
votes
0 answers

Strange output of the `adist` fuction in R (string distance)

Why is output not equal to a 1 * 1 matrix here? EDIT : the strange behaviour comes from the diag function dist <- adist("errors", "eror", costs = c(1, 1, 1), counts = T) dist [,1] [1,] 2 attr(,"counts") , , ins [,1] [1,] 0 , ,…
Julien
  • 1,613
  • 1
  • 10
  • 26
0
votes
2 answers

Joining dataframes on text strings using fuzzy string matching (stringdist_join())

I'm trying to join two datasets on based on the values of two variables. Both datasets have the same variable names/number of columns but may have a different number of rows. I want to join them based on a grouping variable ("SampleID") and a…
JRock
  • 1
  • 2
0
votes
0 answers

R Exact and Fuzzy joins using multiple columns

I am working on data that contains a list of treatment providers in one file and rural-urban codes in another file. The ultimate goal is to link the rural-urban codes to the treatment centers' locations via county and state locations. I have tried…
aarsmith
  • 65
  • 6
0
votes
1 answer

Using stringdist in R with big dataset (1.8 millions rows)?

I'm working with a dataset(df) which contains a column call job, where people just enter their job position. The problem is because the data is typed manually so they contains a lot of misspelling errors. To do some calculations grouping by job, I'm…
Tung Anh
  • 3
  • 2
0
votes
1 answer

Compare one column in database x to another column in database y and return a database z containing high likely matches

I want to take a list of Customer names and compare them to an internal database to find a high likely match and return a customer code So I would receive a list of customers like this: Cx Name Chicken C. Water Gmbh Computer ldt Food,…
NomNonYon
  • 87
  • 6
0
votes
1 answer

how to create loop for multiple output vectors with grabl function in stringdist

I'm trying to apply the grabl function of stringdist to a large character vector "testref". I want to check for whether the strings in another character vector "testtitle" can be found in "testref". However, grabl does only allow for a single string…
Jonas
  • 1
  • 1
0
votes
0 answers

TERR - Spotfire Custom Column Expression not working using stringdist

This works fine in R Studio. library(stringdist) result <- afind(input1, input2, method="cosine") distance <- result[2] real_distance <- distance[[1]][1] output <- real_distance When I add it as a column expression in Spotfire, where input1 is a…
smackenzie
  • 2,880
  • 7
  • 46
  • 99
0
votes
1 answer

R - stringdist aFind() method, no maxDist parameter

The documentation for aFind, specifies a maxDist paramater you can use, but there is no maxDist parameter you can pass into aFind? https://cran.r-project.org/web/packages/stringdist/stringdist.pdf using this code: result = afind(ae_target_term,…
smackenzie
  • 2,880
  • 7
  • 46
  • 99
0
votes
0 answers

Cosine similarity between rows of two large dataframes in R

I've two dataframes DF1 and DF2. One of them is a very large DF. I've created examples DF1 and 2 like this: library(tidyverse) A<-rep(c('Mavs', 'Spurs', 'Lakers', 'Cavs', 'Suns'), 1000000) DF1<-data.frame(A) B<-rep(c('Rockets', 'Pacers',…
0
votes
1 answer

Match two columns based on string distance in R

I have two very large dataframes containing names of people. The two dataframes report different information on these people (i.e. df1 reports data on health status and df2 on socio-economic status). A subset of people appears in both dataframes.…
srocco
  • 108
  • 7
0
votes
0 answers

Finding the pairwise distance of thousands of strings

My dataset looks like this but I have around 510^5 sequences/strings to compare pairwise and calculate their levenshtein distance. I understand that these would lead to a matrix of 510^5 * 5*10^5 elements. I have tried so far the following packages…
LDT
  • 2,856
  • 2
  • 15
  • 32