Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions

votes

1 answer

String matching using 'stringdist' and 'amatch' in R

This is a question for anyone familiar with the 'stringdist' package. I am trying to write a function that does the following: Searches a very long list of characters such as this (only 16 of ~1 million shown): > stripList [1]…

asked Mar 13 '14 at 06:52

tomathon

vote

3 answers

Use if/then for loop and amatch or match to find similar values and match two dataframe columns?

I have two dataframes, one with raw data labels and one with the correct adjusted values the data needs to be matched to. The labels are numeric but can differ up to +/- 2. I am trying to figure out how to write a coded if/then loop since amatch…

r for-loop if-statement matching stringdist

asked Jul 06 '23 at 17:25

Elizabeth Wallace

vote

0 answers

Fixing fuzzyjoin error message: vector memory exhausted

I'm trying to join two data sets using fuzzy matching through the stringdist_left_join function from the library fuzzy join, but I keep getting the error message "Error: vector memory exhausted (limit reached?)." Does anybody know why this may be…

r stringdist fuzzyjoin

asked Apr 13 '23 at 17:43

yankees_fan

vote

1 answer

Speeding up a nested loop in R for distance comparison

I have 2 dataframes - STORE_LIST_A (50,000 rows) & STORE_LIST_B (30,000 rows). Both these dataframes contain these 3 columns - STORE_ID, LATITUDE,…

r loops stringdist

asked Apr 07 '23 at 07:13

avishkar683

vote

2 answers

how to replace a dataframe with another dataframe in R

i want to replace a df1 data, with df2, which df2 is a data like df1 example df1 <- data.frame( name = c( "A. MAHJUM-61365", "A. MAHJUM-61365. MAHJUM-61365", "A. RIZAL. AD-11002795", "A. RIZAL. AD-11002795. RIZAL. AD-11002795", …

r dplyr stringr stringdist

asked Apr 04 '23 at 06:03

Fadhil Dzikri

vote

0 answers

Data consolidation and cleaning using fuzzy string comparisons with -matchit- command

I have two databases, one designated data and another data1 (reference), where I want to compare the codes of each data designation and data2, I have to do it by writing the designations, if they are written the same or similar, I have to have the…

r text-mining strsplit stringdist fuzzyjoin

asked Jan 20 '23 at 15:23

Mariama Drame

vote

1 answer

Using stringdist_join with differing column names

I have example data as follows: library(fuzzyjoin) a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2", "3", "1","6")) b <- data.frame(x = c("season", "seson", "seson", "package", "pakkage"), w = c("1","2",…

r stringdist

asked Apr 16 '22 at 09:00

Tom

2,173
1
17
44

vote

2 answers

Stringdist distance unexpectedly large

The following data has the surprising result that it does not match. I was expecting the distance to be 5, but even at 7 I get no match library(fuzzyjoin) one <- as.data.frame("Other field crops (non-organic)") names(one) <- "A" two <- …

r string levenshtein-distance stringdist

asked Apr 14 '22 at 13:19

Tom

2,173
1
17
44

vote

0 answers

Edit distance for a four-digit sequential ranking in R? (stringdist)

Right now, I am trying to create scale scores for participants who ranked four job candidates (A, B, C, and D) to a role from best fit to worst fit. The correct order is A, D, C, B. As far as my dataframe goes, the correct sequence for columns A, B,…

r edit-distance stringdist

asked Apr 07 '22 at 01:12

xenotharm

vote

1 answer

Ignoring the case for maxDist in stringdist::extract

I am using the stringdist package in R. For several options: grab(x, pattern, maxDist = Inf, value = FALSE, ...) grabl(x, pattern, maxDist = Inf, ...) extract(x, pattern, maxDist = Inf, ...) it uses maxDist. This option however counts the…

r stringdist

asked Nov 04 '21 at 12:13

Tom

2,173
1
17
44

vote

1 answer

Finding matches for multiple words with stringdist

I have test data as follows. I am trying to find (near) matches for a vector of words, using stringdist as the actual database is large: library(stringdist) test_data <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,…

r string fuzzy-search stringdist

asked Nov 03 '21 at 13:13

Tom

2,173
1
17
44

vote

4 answers

Determine (dis)similarity of multi-word strings on a word-by-word basis

I'm working on string distance in multi-word strings, as in this toy data: df <- data.frame( col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz") ) I'd like to determine the (dis)similarity of each row compared to the next row on a…

r dplyr stringdist

asked Oct 22 '21 at 08:09

Chris Ruehlemann

20,321
4
12
34

vote

1 answer

How do I lock the first digits of the 'by' column in a stringdist join?

I am trying to use stringdist_join to merge two tables. I have built my 'by' variable as the concatenation of three variables which are named as such: UAI : a serial number nom : surname prenom : name The code below works well, however I'd like to…

r string dplyr merge stringdist

asked Oct 19 '21 at 16:06

David Potrel

vote

1 answer

Order mismatch and similarity

I have two values which their order is mismatched and values are ideally same. When i calculate the string similaratity the score between them is far away from its ideal score col_1 = c("USA,UK,APAC") col_2 =…

r dplyr stringdist

asked Sep 29 '21 at 15:50

san1

vote

1 answer

Multiply two named vectors/matrices, applying an n-gram model (stringdist::qgrams)

I am trying to apply an n-gram character model on a string to compute its probability in this model. I created a character bigram model with stringdist::qgram(): library(tidyverse) library(stringdist) ref_corpus <- c("This is a sample sentence",…

r matrix-multiplication n-gram stringdist

asked Aug 24 '21 at 19:34

iNyar

1,916
1
17
31

Prev 1 2 3

…

10 11 Next