1

Im attempting to do some distance matching in R and am struggling to achieve a usable output.

I have a dataframe terms that contains 5 strings of text, along with a category for each string. I have a second dataframe notes that contains 10 poorly spelt words, along with a NoteID.

I want to be able to compare each of my 5 terms against each of my 10 notes using a distance algorithm to try to grab simple spelling errors. I have tried:

near_match<- subset(notes, jarowinkler(notes$word, terms$word) >0.9)

   NoteID    Note
5      e5 thought
10     e5   tough

and

jarowinkler(notes$word, terms$word)

[1] 0.8000000 0.7777778 0.8266667 0.8833333 0.9714286 0.8000000 0.8000000 0.8266667 0.8833333 0.9500000

The first instance is almost what I need, it just lacks the word from terms that has caused the match. The second returns 10 scores but I'm not sure if the algorithm checked each of the 5 terms against each of the 10 notes in turn and just returned the closest match (highest score) or not.

How can I alter the above to achieve my desired output if what I want is achievable using jarowinkler() or is there a better option?

I'm relatively new to R so appreciate any help in furthering my understanding how the algorithm generates the scores and what the approach to achieve my desired output would be.

example dataframes below

Thanks

> notes
   NoteID    word
1      a1     hit
2      b2     hot
3      c3   shirt
4      d4    than
5      e5 thought
6      a1     hat
7      b2     get
8      c3   shirt
9      d4    than
10     e5   tough

> terms
  Category   word
1        a    hot
2        b    got
3        a   shot
4        d   that
5        c though
Cam23 19
  • 19
  • 6
  • 1
    The [stringdist](https://cran.r-project.org/web/packages/stringdist/stringdist.pdf) package can help with this stuff, if you find yourself doing this a lot. – twedl Jan 31 '18 at 15:35
  • Also the [fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin) package is great. – twedl Jan 31 '18 at 16:00

1 Answers1

1

Your data.frames:

notes<-data.frame(NoteID=c("a1","b2","c3","d4","e5","a1","b2","c3","d4","e5"),
                  word=c("hit","hot","shirt","than","thought","hat","get","shirt","that","tough"))
terms<-data.frame(Category=c("a","b","c","d","e"),
                  word=c("hot","got","shot","that","though"))

Use stringdistmatrix (package stringdist) with method "jw" (jarowinkler)

library(stringdist)
dist<-stringdistmatrix(notes$word,terms$word,method = "jw")
row.names(dist)<-as.character(notes$word)
colnames(dist)<-as.character(terms$word)

Now you have all distances:

dist
              hot       got       shot       that     though
hit     0.2222222 0.4444444 0.27777778 0.27777778 0.50000000
hot     0.0000000 0.2222222 0.08333333 0.27777778 0.33333333
shirt   0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
than    0.4722222 1.0000000 0.50000000 0.16666667 0.38888889
thought 0.3571429 0.5158730 0.40476190 0.40476190 0.04761905
hat     0.2222222 0.4444444 0.27777778 0.08333333 0.50000000
get     0.4444444 0.2222222 0.47222222 0.47222222 0.50000000
shirt   0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
that    0.2777778 0.4722222 0.33333333 0.00000000 0.38888889
tough   0.4888889 0.4888889 0.51666667 0.51666667 0.05555556

Find the word more close to notes

output<-cbind(notes,word_close=terms[as.numeric(apply(dist, 1, which.min)),"word"],dist_min=apply(dist, 1, min))
output
       NoteID    word word_close   dist_min
    1      a1     hit        hot 0.22222222
    2      b2     hot        hot 0.00000000
    3      c3   shirt       shot 0.21666667
    4      d4    than       that 0.16666667
    5      e5 thought     though 0.04761905
    6      a1     hat       that 0.08333333
    7      b2     get        got 0.22222222
    8      c3   shirt       shot 0.21666667
    9      d4    that       that 0.00000000
    10     e5   tough     though 0.05555556

If you want have just in word_close the words under a certain distance threshold (in this case 0.1), you can do this:

output[output$dist_min>=0.1,c("word_close","dist_min")]<-NA
output
   NoteID    word word_close   dist_min
1      a1     hit       <NA>         NA
2      b2     hot        hot 0.00000000
3      c3   shirt       <NA>         NA
4      d4    than       <NA>         NA
5      e5 thought     though 0.04761905
6      a1     hat       that 0.08333333
7      b2     get       <NA>         NA
8      c3   shirt       <NA>         NA
9      d4    that       that 0.00000000
10     e5   tough     though 0.05555556
Terru_theTerror
  • 4,918
  • 2
  • 20
  • 39
  • this looks great but are you able to explain how the scoring works? ie is this just on a 1 to 1 basis or as a comparison across the other terms? the first row above i would have thought hot and got would score the same as it would take both one change to get to hit? and also can i define what range of scoring i want to see a match for ie in the above scores of 0.1 - 0.25 would indicate near words (I'm already finding exact matches). Thanks – Cam23 19 Feb 07 '18 at 15:38
  • In this example I used Jaro–Winkler distance (method="jw"). You can find here the formula and a detailed explanation: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance The match is between all words as you can see in table dist, after that I choose only the most close. I have added in the answer a column with the minimum distance in the output [which is between 0 'perfect match' and 1 'maximum distance'] – Terru_theTerror Feb 07 '18 at 15:39
  • About the scores, I will add some code rows in the answer – Terru_theTerror Feb 07 '18 at 15:45
  • thats great. thanks for the quick response. updated work great and i can tune the 'flag score' as needed – Cam23 19 Feb 08 '18 at 09:29