Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions

votes

1 answer

Get nearest n matching strings

Hi I am trying to match one string from other string in different dataframe and get nearest n matches based on score. EX: from string_2 (df_2) column i need to match with string_1(df_1) and get the nearest 3 matches based on each ID group. ID =…

r dplyr stringdist

asked Jan 06 '22 at 08:01

san1

votes

1 answer

Find the distance between groups of string in R

I have a very large dataset, which looks like this. I have two types of data frames my reference data.frame ref=c("cake","brownies") and my experimental data.frame expr=c("cak","cakee","cake", "rownies","browwnies") I want to match the ref and…

r string stringdist

asked Dec 16 '21 at 19:46

LDT

2,856
2
15
32

votes

2 answers

R fuzzy join with big dataframes

I would like to do a left_join(df1, df2) based on fuzzy matches. My df1 is 100k rows big and my df2 is 25k rows big. Basically I would like to calculate the string similarity with jaro winkler method between the join_colum of the two data frames. So…

r stringdist fuzzyjoin

asked Nov 11 '21 at 14:40

crazy-wasserratte

votes

1 answer

Phrase match irrespective of their position seperated by comma

I have 2 data frames which needs to compare df_1 to df_2 and get similar string from df_2 of col_2 and store their number of phrases matched in df_out data frame col_1 = c("inside the world,worldwide web,google chrome app","world health…

r dplyr stringdist

asked Oct 01 '21 at 07:38

san1

votes

2 answers

Nearest string match and their rowId

i am trying to compare col_1 in df_1 dataframe with col_2 in df_2 dataframe to get nearest top 3 match with least score(least score represents nearest match) and their respective rowid. Also is there any flexibility to change top N nearest…

r dplyr stringdist

asked Sep 29 '21 at 07:44

san1

votes

1 answer

Match strings by distance between non-equal length ones

Say we have the following datasets: Dataset A: name age Sally 22 Peter 35 Joe 57 Samantha 33 Kyle 30 Kieran 41 Molly 28 Dataset B: name company Samanta A Peter B Joey …

r dplyr stringr stringdist

asked Aug 20 '21 at 08:27

teogj

votes

1 answer

String matching using stringdist

I have two data frames with department names similar to these ones: d1 <- data.frame(depto=c("antioquia", "arauca", "arauca", "cauca", "popayan cauca", "guayabal cundinamarca", "cundinamarca", "cundinamarca", "fresno - tolima", "tolima",…

r match stringdist

asked Jul 26 '21 at 14:21

user2246905

1,029
1
12
31

votes

0 answers

How can I speed up this R code, in which I use stringdist?

I'm trying to clean up our customer database by identifying customer data that is similar enough to consider them the same customer (thus, give them the same customer id). I've concatenated relevant customerdata into one column named customerdata.…

r data-analysis data-cleaning levenshtein-distance stringdist

asked Mar 18 '21 at 13:03

Koen Direks

votes

2 answers

How to return a list of pairs of strings from a large matrix that mutually satisfy a maximum stringdistance criterion?

I am trying to make a way of presenting human-input words in a way that makes their groupings more easily recognisable as referring to the same thing. Essentially a spellchecker. I have gotten as far as making a large matrix (the actual one is 250 *…

r stringdist

asked Jan 18 '21 at 16:29

Jose_harkhan

votes

1 answer

Fuzzyjoin / stringdist_join weight for capitalisatoin (case) mismatch (stringdist)

Working with R, I'm looking for ways to weight case (i.e., upper vs lower case) in a string_dist_left_join() Here's a reproducible example: library(tidyverse) library(fuzzyjoin) tibble1 <- tibble(words = c("Bedford", "Maidenhead", "New Forest",…

r stringdist fuzzyjoin

asked Dec 31 '20 at 21:27

gladys_c_hugh

votes

1 answer

JaroWinkler Method --> Identifying Character/Numeric spots in a string

I am working on a problem to identify if a specified string has the correct format. I am attempting to use a fuzzy matching technique, JaroWinkler, to find the similarity score between a reference string and the strings of interest. The correct…

r comparison fuzzy-search stringdist jaro-winkler

asked Nov 30 '20 at 19:27

user2813606

votes

2 answers

Creating new field that shows stringdist between two columns in R?

I have two columns with ~20k rows of names (not all unique) that I want to compare row-by-row between the two columns. I also would like to compare length and get a % difference in length to LV distance so I can start grouping names based on how…

r dplyr stringdist

asked Sep 22 '20 at 22:28

Dinho

votes

1 answer

Iterate through two dataframes in R and compare corresponding column values

I have two data frames with text data about users: x <- data.frame("Address_line1" = c("123 Street","21 Hill drive"), "City" = c("Chicago","London"), "Phone" = c("123","219")) y <- data.frame("Address_line1" = c("461 road","PO Box…

r string dplyr stringr stringdist

asked Jun 24 '20 at 23:09

Rahul

votes

1 answer

How to calculate longest common substring anywhere in two strings

I am trying to calculate the longest exact common substring without gaps between a string and a vector of strings in R. How do I modify stringdist to return any common string anywhere in the two compared strings and return the distance? Reproduce…

r string substring lcs stringdist

asked Jun 17 '20 at 01:22

Neal Barsch

2,810
2
13
39

votes

1 answer

stringdist_semi_join only shows columns from dataframe1

I have two dataframes: df1 <- data.frame(City=c("Munchen_Paris","Munchen_Paris","Barcelona_Milan", "Londen_Dublin","Madrid_Malaga"), value1=c(11,21,33,2,53)) df2 <-…

r stringdist fuzzyjoin

asked Apr 10 '20 at 17:48

user2165379

Prev 1 2 3

…

10 11 Next