Im attempting to do some distance matching in R and am struggling to achieve a usable output.
I have a dataframe terms
that contains 5 strings of text, along with a category for each string. I have a second dataframe notes
that contains 10 poorly spelt words, along with a NoteID.
I want to be able to compare each of my 5 terms
against each of my 10 notes
using a distance algorithm to try to grab simple spelling errors. I have tried:
near_match<- subset(notes, jarowinkler(notes$word, terms$word) >0.9)
NoteID Note
5 e5 thought
10 e5 tough
and
jarowinkler(notes$word, terms$word)
[1] 0.8000000 0.7777778 0.8266667 0.8833333 0.9714286 0.8000000 0.8000000 0.8266667 0.8833333 0.9500000
The first instance is almost what I need, it just lacks the word from terms
that has caused the match. The second returns 10 scores but I'm not sure if the algorithm checked each of the 5 terms
against each of the 10 notes
in turn and just returned the closest match (highest score) or not.
How can I alter the above to achieve my desired output if what I want is achievable using jarowinkler()
or is there a better option?
I'm relatively new to R so appreciate any help in furthering my understanding how the algorithm generates the scores and what the approach to achieve my desired output would be.
example dataframes below
Thanks
> notes
NoteID word
1 a1 hit
2 b2 hot
3 c3 shirt
4 d4 than
5 e5 thought
6 a1 hat
7 b2 get
8 c3 shirt
9 d4 than
10 e5 tough
> terms
Category word
1 a hot
2 b got
3 a shot
4 d that
5 c though