Questions tagged [stringdist]

stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits, qgrams or heuristic metrics. An implementation of soundex is provided as well.

Stringdist is an R package that implements an approximate string matching version of R's native 'match' function. It can calculate various string distances based on edits (damerau-levenshtein, hamming, levenshtein, optimal sting alignment), qgrams (q-gram, cosine, jaccard distance) or heuristic metrics (jaro, jaro-winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

Links:

162 questions
2
votes
1 answer

String matching using 'stringdist' and 'amatch' in R

This is a question for anyone familiar with the 'stringdist' package. I am trying to write a function that does the following: Searches a very long list of characters such as this (only 16 of ~1 million shown): > stripList [1]…
tomathon
  • 834
  • 17
  • 32
1
vote
3 answers

Use if/then for loop and amatch or match to find similar values and match two dataframe columns?

I have two dataframes, one with raw data labels and one with the correct adjusted values the data needs to be matched to. The labels are numeric but can differ up to +/- 2. I am trying to figure out how to write a coded if/then loop since amatch…
1
vote
0 answers

Fixing fuzzyjoin error message: vector memory exhausted

I'm trying to join two data sets using fuzzy matching through the stringdist_left_join function from the library fuzzy join, but I keep getting the error message "Error: vector memory exhausted (limit reached?)." Does anybody know why this may be…
1
vote
1 answer

Speeding up a nested loop in R for distance comparison

I have 2 dataframes - STORE_LIST_A (50,000 rows) & STORE_LIST_B (30,000 rows). Both these dataframes contain these 3 columns - STORE_ID, LATITUDE,…
1
vote
2 answers

how to replace a dataframe with another dataframe in R

i want to replace a df1 data, with df2, which df2 is a data like df1 example df1 <- data.frame( name = c( "A. MAHJUM-61365", "A. MAHJUM-61365. MAHJUM-61365", "A. RIZAL. AD-11002795", "A. RIZAL. AD-11002795. RIZAL. AD-11002795", …
1
vote
0 answers

Data consolidation and cleaning using fuzzy string comparisons with -matchit- command

I have two databases, one designated data and another data1 (reference), where I want to compare the codes of each data designation and data2, I have to do it by writing the designations, if they are written the same or similar, I have to have the…
1
vote
1 answer

Using stringdist_join with differing column names

I have example data as follows: library(fuzzyjoin) a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2", "3", "1","6")) b <- data.frame(x = c("season", "seson", "seson", "package", "pakkage"), w = c("1","2",…
Tom
  • 2,173
  • 1
  • 17
  • 44
1
vote
2 answers

Stringdist distance unexpectedly large

The following data has the surprising result that it does not match. I was expecting the distance to be 5, but even at 7 I get no match library(fuzzyjoin) one <- as.data.frame("Other field crops (non-organic)") names(one) <- "A" two <- …
Tom
  • 2,173
  • 1
  • 17
  • 44
1
vote
0 answers

Edit distance for a four-digit sequential ranking in R? (stringdist)

Right now, I am trying to create scale scores for participants who ranked four job candidates (A, B, C, and D) to a role from best fit to worst fit. The correct order is A, D, C, B. As far as my dataframe goes, the correct sequence for columns A, B,…
xenotharm
  • 11
  • 1
1
vote
1 answer

Ignoring the case for maxDist in stringdist::extract

I am using the stringdist package in R. For several options: grab(x, pattern, maxDist = Inf, value = FALSE, ...) grabl(x, pattern, maxDist = Inf, ...) extract(x, pattern, maxDist = Inf, ...) it uses maxDist. This option however counts the…
Tom
  • 2,173
  • 1
  • 17
  • 44
1
vote
1 answer

Finding matches for multiple words with stringdist

I have test data as follows. I am trying to find (near) matches for a vector of words, using stringdist as the actual database is large: library(stringdist) test_data <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,…
Tom
  • 2,173
  • 1
  • 17
  • 44
1
vote
4 answers

Determine (dis)similarity of multi-word strings on a word-by-word basis

I'm working on string distance in multi-word strings, as in this toy data: df <- data.frame( col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz") ) I'd like to determine the (dis)similarity of each row compared to the next row on a…
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
1
vote
1 answer

How do I lock the first digits of the 'by' column in a stringdist join?

I am trying to use stringdist_join to merge two tables. I have built my 'by' variable as the concatenation of three variables which are named as such: UAI : a serial number nom : surname prenom : name The code below works well, however I'd like to…
David Potrel
  • 111
  • 8
1
vote
1 answer

Order mismatch and similarity

I have two values which their order is mismatched and values are ideally same. When i calculate the string similaratity the score between them is far away from its ideal score col_1 = c("USA,UK,APAC") col_2 =…
san1
  • 455
  • 2
  • 11
1
vote
1 answer

Multiply two named vectors/matrices, applying an n-gram model (stringdist::qgrams)

I am trying to apply an n-gram character model on a string to compute its probability in this model. I created a character bigram model with stringdist::qgram(): library(tidyverse) library(stringdist) ref_corpus <- c("This is a sample sentence",…
iNyar
  • 1,916
  • 1
  • 17
  • 31