0

I am working on a problem to identify if a specified string has the correct format. I am attempting to use a fuzzy matching technique, JaroWinkler, to find the similarity score between a reference string and the strings of interest.

The correct format for the string follows this order (N=number, C=character): NNNCCCCCC

I found a similar problem on another StackOverflow question and edited the code a little here:

library(RecordLinkage)
library(dplyr)
library(stringdist)

ref <-c('123ABCDEF')
words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")

wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)

df <- wordlist %>% 
        group_by(words) %>% 
        mutate(match_score = jarowinkler(words, ref))

df <- as.data.frame(df)
df

I know the JaroWinkler method is used for identifying common characters and considering string distance, but I'm not sure if this is the best method. Ideally, I'd like for the first and last elements in the words vector to be classified as correct and receive scores of 1 since they have the NNNCCCCCC format.

However, when I run this code, I get the following:

      words       ref match_score
1 456GHIJKL 123ABCDEF   0.0000000
2 123ABCDEF 123ABCDEF   1.0000000
3 78D78DAA2 123ABCDEF   0.3148148
4 660ABCDEF 123ABCDEF   0.7777778

Is there a better method for this type of matching exercise? Any help would be appreciated! Thank you!

user2813606
  • 797
  • 2
  • 13
  • 37
  • If you are looking for a specific pattern, I wouldn't use approximate string matching. You have a clear pattern: 3 digits folloed by 6 characters. So you shoudl rather do an exact string matching. – deschen Nov 30 '20 at 21:20

1 Answers1

1

As suggested in the comment above, I would do an exact string matching. Only uncertainty for now is what do you mean with "characters"? Only letters from A-Z or als e.g. punctuations? If only letters, see the code below.

library(tidyverse)

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{6})")

which gives:

[1]  TRUE  TRUE FALSE  TRUE

Updating the answer to reflect the TOs changed pattern

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF", "660A7CDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{1})(?=[[:digit:]]{1}|[[:alpha:]]{1})(?=[[:alpha:]]{5})")

gives:

[1]  TRUE  TRUE FALSE  TRUE  TRUE
deschen
  • 10,012
  • 3
  • 27
  • 50
  • Thanks so much! How would I allow for some flexibility if lets say the 5th spot could also be character? – user2813606 Dec 01 '20 at 14:13
  • 1
    Could be a character or has to be a character? – deschen Dec 01 '20 at 20:16
  • I’m thinking along the lines of if I wanted to say positions 1-3 have to be numeric, position 4 has to be character, position 5 can be either, and then positions 6-9 have to be numeric. – user2813606 Dec 02 '20 at 21:03