In my text of news articles I would like to convert several different ngrams that refer to the same political party to an acronym. I would like to do this because I would like to avoid any sentiment dictionaries confusing the words in the party's name (Liberal Party) with the same word in different contexts (liberal helping).
I can do this below with str_replace_all
and I know about the token_compound()
function in quanteda, but it doesn't seem to do exactly what I need.
library(stringr)
text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')
Should I somehow just preprocess the text before turning it into a corpus? Or is there a way to do this after turning it into a corpus in quanteda
.
Here is some expanded sample code that specifies the problem a little better:
`text<-c('a text about some political parties called the new democratic party
the new democrats and the liberal party and the liberals. I would like the
word democratic to be counted in the dfm but not the words new democratic.
The same goes for liberal helpings but not liberal party')
partydict <- dictionary(list(
olp = c("liberal party", "liberals"),
ndp = c("new democrats", "new democratic party"),
sentiment=c('liberal', 'democratic')
))
dfm(text, dictionary=partydict)`
This example counts democratic
in both the new democratic
and the democratic
sense, but I would those counted separately.