I am using tokens_lookup
to see whether some texts contain the words in my dictionary. Now I am trying to find a way to discard the matches that occur when the dictionary word is in an ordered sequence of words. To make an example, suppose that Ireland is in the dictionary. I would like to exclude the cases where, for instance, Northern Ireland is mentioned (or any fixed set of words that contains Britain). The only indirect solution that I figured out is to build another dictionary with these sets of words (e.g. Great Britain). However, this solution would not work when both Britain and Great Britain are cited. Thank you.
library("quanteda")
dict <- dictionary(list(IE = "Ireland"))
txt <- c(
doc1 = "Ireland lorem ipsum",
doc2 = "Lorem ipsum Northern Ireland",
doc3 = "Ireland lorem ipsum Northern Ireland"
)
toks <- tokens(txt)
tokens_lookup(toks, dictionary = dict)