I have two vectors of words.
Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo')
Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')
I need to make the best possible match between the Lexicon and Corpus. I tried many methods. This is one of them.
library(stringr)
match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words
test<- str_extrac_all (Corpus,match,simplify= T)
test
[,1]
[1,] "animal"
[2,] "fe"
[3,] "fe"
[4,] "ladr"
But, the match should be:
[1,] "animalada"
[2,] "fe"
[3,] "fernandez"
[1,] "ladrillo"
Instead, the match is with the first word alphabetically ordered in my Lexicon. By the way, these vectors are a sample of a bigger list that I have.
I didn´t try with regex() because I'm not sure how it works. Perhaps the solution goes on that way.
Could you help me to solve this problem? Thank you for your help.