Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions

vote

0 answers

R converting nested for loops_the nested parallel foreach doesn't work

I have a table with ~1M entry points (where each line is an insurance contract, i.e. one client can have multiple contracts) and cols client_id, names and adresses. The problem I am trying to solve is that the same client can have different…

asked Aug 16 '21 at 18:18

Yacine Hafiane

vote

1 answer

Merging two data frame based on maximum numbers of words in commonin R

I have two data.frame one containing partial name and the other one containing full name as follow partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF", "wizz air", "WeMove.eu", "ILU") full <- data.frame("full.name" = c("Apple Inc",…

r stringr sqldf stringdist fuzzyjoin

asked May 08 '21 at 14:18

JMCrocs

vote

0 answers

Finding the best similarity measure for a group of documents

As someone new to NLP, I am trying to find a solution to a problem that doesn't seem to be well documented - estimating the degree similarity for a group of documents as opposed to a pair of documents. Say that I have two groups of words a and b ,…

r nlp similarity cosine-similarity stringdist

asked Aug 17 '20 at 05:09

iskandarblue

7,208
15
60
130

vote

1 answer

Compare corresponding columns of a data frame with list in R

I have a data frame containing user data x <- data.frame("Address_line1" = c("461 road","PO Box 123","543 Highway"), "City" = c("Dallas","Paris","New York" ), "Phone" = c("235","542","842")) x Address_line1 City Phone 1 …

r dplyr tidyr stringdist

asked Jun 26 '20 at 16:36

Rahul

vote

2 answers

fuzzy grouping in R

library(tidyverse) data <- tibble(city =c('Montreal','Montréal','Ottawa','Ottawa','New York','Newyork','New-York'), value = 1:7) data%>% group_by(city)%>% summarise(mean = mean(value)) and I'd like to obtain something like that but unfortunately…

r string fuzzy stringdist

asked May 12 '20 at 19:59

olivroy

vote

1 answer

R Function to identify non-matching rows

I am trying to compare 2 data.frames, "V1" represents my CRM, "V2" represents Leads that I would like to send out. 'V1 has roughly 8k elements' 'V2 has roughly 25k elements' I need to compare every row in V2 to every row in V1, discard every…

r tidyverse stringdist

asked Apr 13 '20 at 21:07

sbaumbaugh

vote

1 answer

Replacing for loop with apply fuctions

I want to replace nested for loops with appropriate apply function in R. I declare a matrix with the following dimensions - ncol is 412 and nrow 2164 dist.name.enh <- matrix(NA, ncol = length(WW_name),nrow = length(Px_name)) The for loops for…

r for-loop lapply sapply stringdist

asked Jan 21 '20 at 06:34

darkage

vote

1 answer

R Finding elements matching with each other within a vector

I have a list of addresses. These addresses were input by various users and hence there are lot of differences in the way a same address is written. For example, "andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh…

r pattern-matching stringdist agrep

asked Dec 31 '19 at 08:01

Apricot

2,925
5
42
88

vote

1 answer

Accept "close matches" when using strings in a python functions?

I'm trying to use a shortest path function to find the distance between strings in a graph. The problem is that sometimes there are close matches that I want to count. For example, I would like "communication" to count as "communications" or…

python string nlp fuzzy-comparison stringdist

asked Nov 01 '19 at 13:06

Kenneth Crowther

vote

1 answer

Calculate Levenshtein/Hamming distance by grouping variable

I am trying to calculate the accuracy of participants' response (column MEM_Response) based on the correct response (columns MEM_Correct). The grouping variable would be the participant's ID (in this case column SERIAL--> 15 cases per…

r levenshtein-distance hamming-distance stringdist

asked Jun 25 '19 at 09:53

annedroid

vote

1 answer

How to explicitly build sparse stringdistmatrix to avoid running out of memory?

Match large number of slightly varying restaurant names in "data" vector to appropriate "match" vector: The stringdistmatrix function in stringdist package is great, but runs out of memory for a few 10k x 10k and my data is larger. Tried…

r sparse-matrix stringdist

asked Jun 23 '19 at 20:43

David Lucey

vote

2 answers

Remove for loop from stringdist algorithm in R

I've made an algorithm to determine scores of matching strings from 2 dataframes in R. It will search for each row in test_ech the matching rows which their score is above 0.75 in test_data (based on the matching of 3 columns from each data frame).…

r for-loop stringdist

asked Jun 04 '19 at 12:31

Amine96

vote

1 answer

Order Independent String Matching in R

I am trying to match names in Table A to the the names present in master table. The order of names present in Table A is not exactly in a consistent format which means not necessarily name will start with first name, it's all random in some cases it…

r string matching fuzzy stringdist

asked Feb 25 '19 at 16:21

chitvan gupta

vote

1 answer

String matching using stringdist in r?

I want to match and then later replace the string to the closest match. I am using the stringdist library. Below is my code stringdistmatrix("2 ltr thums up", c("solar thyme 30g", "Thums Up 2 L"), method = "lv") It gives the output as below: [,1]…

r levenshtein-distance stringdist

asked Feb 15 '19 at 11:14

nk23

vote

0 answers

String distances and variable substitution costs

I want to quantify distance between word pairs based on phonological features. Insertion and deletion costs will stay constant but substitution costs will vary according to letter pair which are stored in a matrix. I am thinking of using the…

stringdist variable-substitution

asked Mar 26 '18 at 15:59

Rkindellan

Prev 1 2 3

…

10 11 Next