Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
1
vote
0 answers

R converting nested for loops_the nested parallel foreach doesn't work

I have a table with ~1M entry points (where each line is an insurance contract, i.e. one client can have multiple contracts) and cols client_id, names and adresses. The problem I am trying to solve is that the same client can have different…
1
vote
1 answer

Merging two data frame based on maximum numbers of words in commonin R

I have two data.frame one containing partial name and the other one containing full name as follow partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF", "wizz air", "WeMove.eu", "ILU") full <- data.frame("full.name" = c("Apple Inc",…
JMCrocs
  • 77
  • 7
1
vote
0 answers

Finding the best similarity measure for a group of documents

As someone new to NLP, I am trying to find a solution to a problem that doesn't seem to be well documented - estimating the degree similarity for a group of documents as opposed to a pair of documents. Say that I have two groups of words a and b ,…
iskandarblue
  • 7,208
  • 15
  • 60
  • 130
1
vote
1 answer

Compare corresponding columns of a data frame with list in R

I have a data frame containing user data x <- data.frame("Address_line1" = c("461 road","PO Box 123","543 Highway"), "City" = c("Dallas","Paris","New York" ), "Phone" = c("235","542","842")) x Address_line1 City Phone 1 …
Rahul
  • 23
  • 2
1
vote
2 answers

fuzzy grouping in R

library(tidyverse) data <- tibble(city =c('Montreal','Montréal','Ottawa','Ottawa','New York','Newyork','New-York'), value = 1:7) data%>% group_by(city)%>% summarise(mean = mean(value)) and I'd like to obtain something like that but unfortunately…
olivroy
  • 548
  • 3
  • 13
1
vote
1 answer

R Function to identify non-matching rows

I am trying to compare 2 data.frames, "V1" represents my CRM, "V2" represents Leads that I would like to send out. 'V1 has roughly 8k elements' 'V2 has roughly 25k elements' I need to compare every row in V2 to every row in V1, discard every…
sbaumbaugh
  • 13
  • 5
1
vote
1 answer

Replacing for loop with apply fuctions

I want to replace nested for loops with appropriate apply function in R. I declare a matrix with the following dimensions - ncol is 412 and nrow 2164 dist.name.enh <- matrix(NA, ncol = length(WW_name),nrow = length(Px_name)) The for loops for…
darkage
  • 857
  • 3
  • 12
  • 22
1
vote
1 answer

R Finding elements matching with each other within a vector

I have a list of addresses. These addresses were input by various users and hence there are lot of differences in the way a same address is written. For example, "andheri at weh pump house", "andheri pump house","andheri pump house(mt)","weh…
Apricot
  • 2,925
  • 5
  • 42
  • 88
1
vote
1 answer

Accept "close matches" when using strings in a python functions?

I'm trying to use a shortest path function to find the distance between strings in a graph. The problem is that sometimes there are close matches that I want to count. For example, I would like "communication" to count as "communications" or…
1
vote
1 answer

Calculate Levenshtein/Hamming distance by grouping variable

I am trying to calculate the accuracy of participants' response (column MEM_Response) based on the correct response (columns MEM_Correct). The grouping variable would be the participant's ID (in this case column SERIAL--> 15 cases per…
1
vote
1 answer

How to explicitly build sparse stringdistmatrix to avoid running out of memory?

Match large number of slightly varying restaurant names in "data" vector to appropriate "match" vector: The stringdistmatrix function in stringdist package is great, but runs out of memory for a few 10k x 10k and my data is larger. Tried…
David Lucey
  • 252
  • 3
  • 9
1
vote
2 answers

Remove for loop from stringdist algorithm in R

I've made an algorithm to determine scores of matching strings from 2 dataframes in R. It will search for each row in test_ech the matching rows which their score is above 0.75 in test_data (based on the matching of 3 columns from each data frame).…
Amine96
  • 65
  • 6
1
vote
1 answer

Order Independent String Matching in R

I am trying to match names in Table A to the the names present in master table. The order of names present in Table A is not exactly in a consistent format which means not necessarily name will start with first name, it's all random in some cases it…
1
vote
1 answer

String matching using stringdist in r?

I want to match and then later replace the string to the closest match. I am using the stringdist library. Below is my code stringdistmatrix("2 ltr thums up", c("solar thyme 30g", "Thums Up 2 L"), method = "lv") It gives the output as below: [,1]…
nk23
  • 179
  • 1
  • 10
1
vote
0 answers

String distances and variable substitution costs

I want to quantify distance between word pairs based on phonological features. Insertion and deletion costs will stay constant but substitution costs will vary according to letter pair which are stored in a matrix. I am thinking of using the…
Rkindellan
  • 11
  • 1