String match with R: Finding the best possible match

Question

I have two vectors of words.

Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo')

Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')

I need to make the best possible match between the Lexicon and Corpus. I tried many methods. This is one of them.

library(stringr)

match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words

test<- str_extrac_all (Corpus,match,simplify= T)

test

[,1]
[1,] "animal"
[2,] "fe"
[3,] "fe"
[4,] "ladr"

But, the match should be:

[1,] "animalada"
[2,] "fe"
[3,] "fernandez"
[1,] "ladrillo"

Instead, the match is with the first word alphabetically ordered in my Lexicon. By the way, these vectors are a sample of a bigger list that I have.

I didn´t try with regex() because I'm not sure how it works. Perhaps the solution goes on that way.

Could you help me to solve this problem? Thank you for your help.

score 1 · Answer 1 · answered Sep 23 '17 at 01:59

1

You can just use match function.

Index <- match(Corpus, Lexicon)

Index
[1] 2 3 4 6

Lexicon[Index]
[1] "animalada"  "fe"   "fernandez"  "ladrillo"

answered Sep 23 '17 at 01:59

Santosh M.

2,356
1
17
29

score 0 · Answer 2 · answered Sep 23 '17 at 01:54

0

You can order Lexicon by the number of characters the patterns have, in decreasing order, so the best match comes first:

match<- paste(Lexicon[order(-nchar(Lexicon))], collapse = '|^')

test<- str_extract_all(Corpus, match, simplify= T)

test
#     [,1]       
#[1,] "animalada"
#[2,] "fe"       
#[3,] "fernandez"
#[4,] "ladrillo"

answered Sep 23 '17 at 01:54

Psidom

209,562
33
339
356

I´m testing your answers with the real Lexicon. I´ll inform the results later. Thank you both – pch919 Sep 23 '17 at 21:03

score 0 · Answer 3 · answered Sep 27 '17 at 03:16

I tried both methods and the right one was the suggested by @Psidorm. If a use the function match() this will find the match in any part of the word, not necessary the beginning. For instance:

Corpus<- c('tambien')
Lexicon<- c('bien')
match(Corpus,Lexicon)

The result is 'tambien', but this is not correct.

Again, thank you both for your help!!

String match with R: Finding the best possible match

3 Answers3