Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
0
votes
1 answer

Get nearest n matching strings

Hi I am trying to match one string from other string in different dataframe and get nearest n matches based on score. EX: from string_2 (df_2) column i need to match with string_1(df_1) and get the nearest 3 matches based on each ID group. ID =…
san1
  • 455
  • 2
  • 11
0
votes
1 answer

Find the distance between groups of string in R

I have a very large dataset, which looks like this. I have two types of data frames my reference data.frame ref=c("cake","brownies") and my experimental data.frame expr=c("cak","cakee","cake", "rownies","browwnies") I want to match the ref and…
LDT
  • 2,856
  • 2
  • 15
  • 32
0
votes
2 answers

R fuzzy join with big dataframes

I would like to do a left_join(df1, df2) based on fuzzy matches. My df1 is 100k rows big and my df2 is 25k rows big. Basically I would like to calculate the string similarity with jaro winkler method between the join_colum of the two data frames. So…
0
votes
1 answer

Phrase match irrespective of their position seperated by comma

I have 2 data frames which needs to compare df_1 to df_2 and get similar string from df_2 of col_2 and store their number of phrases matched in df_out data frame col_1 = c("inside the world,worldwide web,google chrome app","world health…
san1
  • 455
  • 2
  • 11
0
votes
2 answers

Nearest string match and their rowId

i am trying to compare col_1 in df_1 dataframe with col_2 in df_2 dataframe to get nearest top 3 match with least score(least score represents nearest match) and their respective rowid. Also is there any flexibility to change top N nearest…
san1
  • 455
  • 2
  • 11
0
votes
1 answer

Match strings by distance between non-equal length ones

Say we have the following datasets: Dataset A: name age Sally 22 Peter 35 Joe 57 Samantha 33 Kyle 30 Kieran 41 Molly 28 Dataset B: name company Samanta A Peter B Joey …
teogj
  • 289
  • 1
  • 11
0
votes
1 answer

String matching using stringdist

I have two data frames with department names similar to these ones: d1 <- data.frame(depto=c("antioquia", "arauca", "arauca", "cauca", "popayan cauca", "guayabal cundinamarca", "cundinamarca", "cundinamarca", "fresno - tolima", "tolima",…
user2246905
  • 1,029
  • 1
  • 12
  • 31
0
votes
0 answers

How can I speed up this R code, in which I use stringdist?

I'm trying to clean up our customer database by identifying customer data that is similar enough to consider them the same customer (thus, give them the same customer id). I've concatenated relevant customerdata into one column named customerdata.…
0
votes
2 answers

How to return a list of pairs of strings from a large matrix that mutually satisfy a maximum stringdistance criterion?

I am trying to make a way of presenting human-input words in a way that makes their groupings more easily recognisable as referring to the same thing. Essentially a spellchecker. I have gotten as far as making a large matrix (the actual one is 250 *…
0
votes
1 answer

Fuzzyjoin / stringdist_join weight for capitalisatoin (case) mismatch (stringdist)

Working with R, I'm looking for ways to weight case (i.e., upper vs lower case) in a string_dist_left_join() Here's a reproducible example: library(tidyverse) library(fuzzyjoin) tibble1 <- tibble(words = c("Bedford", "Maidenhead", "New Forest",…
gladys_c_hugh
  • 158
  • 1
  • 9
0
votes
1 answer

JaroWinkler Method --> Identifying Character/Numeric spots in a string

I am working on a problem to identify if a specified string has the correct format. I am attempting to use a fuzzy matching technique, JaroWinkler, to find the similarity score between a reference string and the strings of interest. The correct…
user2813606
  • 797
  • 2
  • 13
  • 37
0
votes
2 answers

Creating new field that shows stringdist between two columns in R?

I have two columns with ~20k rows of names (not all unique) that I want to compare row-by-row between the two columns. I also would like to compare length and get a % difference in length to LV distance so I can start grouping names based on how…
Dinho
  • 704
  • 4
  • 15
0
votes
1 answer

Iterate through two dataframes in R and compare corresponding column values

I have two data frames with text data about users: x <- data.frame("Address_line1" = c("123 Street","21 Hill drive"), "City" = c("Chicago","London"), "Phone" = c("123","219")) y <- data.frame("Address_line1" = c("461 road","PO Box…
Rahul
  • 23
  • 2
0
votes
1 answer

How to calculate longest common substring anywhere in two strings

I am trying to calculate the longest exact common substring without gaps between a string and a vector of strings in R. How do I modify stringdist to return any common string anywhere in the two compared strings and return the distance? Reproduce…
Neal Barsch
  • 2,810
  • 2
  • 13
  • 39
0
votes
1 answer

stringdist_semi_join only shows columns from dataframe1

I have two dataframes: df1 <- data.frame(City=c("Munchen_Paris","Munchen_Paris","Barcelona_Milan", "Londen_Dublin","Madrid_Malaga"), value1=c(11,21,33,2,53)) df2 <-…
user2165379
  • 445
  • 4
  • 20