0

I have two vectors of words.

Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo')

Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')

I need to make the best possible match between the Lexicon and Corpus. I tried many methods. This is one of them.

library(stringr)

match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words

test<- str_extrac_all (Corpus,match,simplify= T)

test

[,1]
[1,] "animal"
[2,] "fe"
[3,] "fe"
[4,] "ladr"

But, the match should be:

[1,] "animalada"
[2,] "fe"
[3,] "fernandez"
[1,] "ladrillo"

Instead, the match is with the first word alphabetically ordered in my Lexicon. By the way, these vectors are a sample of a bigger list that I have.

I didn´t try with regex() because I'm not sure how it works. Perhaps the solution goes on that way.

Could you help me to solve this problem? Thank you for your help.

pch919
  • 19
  • 3

3 Answers3

1

You can just use match function.

Index <- match(Corpus, Lexicon)

Index
[1] 2 3 4 6

Lexicon[Index]
[1] "animalada"  "fe"   "fernandez"  "ladrillo"
Santosh M.
  • 2,356
  • 1
  • 17
  • 29
0

You can order Lexicon by the number of characters the patterns have, in decreasing order, so the best match comes first:

match<- paste(Lexicon[order(-nchar(Lexicon))], collapse = '|^')

test<- str_extract_all(Corpus, match, simplify= T)

test
#     [,1]       
#[1,] "animalada"
#[2,] "fe"       
#[3,] "fernandez"
#[4,] "ladrillo" 
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • I´m testing your answers with the real Lexicon. I´ll inform the results later. Thank you both – pch919 Sep 23 '17 at 21:03
0

I tried both methods and the right one was the suggested by @Psidorm. If a use the function match() this will find the match in any part of the word, not necessary the beginning. For instance:

Corpus<- c('tambien')
Lexicon<- c('bien')
match(Corpus,Lexicon)

The result is 'tambien', but this is not correct.

Again, thank you both for your help!!

pch919
  • 19
  • 3