substituting several ngrams in quanteda

Question

In my text of news articles I would like to convert several different ngrams that refer to the same political party to an acronym. I would like to do this because I would like to avoid any sentiment dictionaries confusing the words in the party's name (Liberal Party) with the same word in different contexts (liberal helping).

I can do this below with str_replace_all and I know about the token_compound() function in quanteda, but it doesn't seem to do exactly what I need.

library(stringr)
text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')

Should I somehow just preprocess the text before turning it into a corpus? Or is there a way to do this after turning it into a corpus in quanteda.

Here is some expanded sample code that specifies the problem a little better:

`text<-c('a text about some political parties called the new democratic party 
the new democrats and the liberal party and the liberals. I would like the 
word democratic to be counted in the dfm but not the words new democratic. 
The same goes for liberal helpings but not liberal party')
partydict <- dictionary(list(
olp = c("liberal party", "liberals"),
ndp = c("new democrats", "new democratic party"),
sentiment=c('liberal', 'democratic')
))

dfm(text, dictionary=partydict)`

This example counts democratic in both the new democratic and the democratic sense, but I would those counted separately.

Ken Benoit · Accepted Answer · 2018-10-05T19:01:14.430

You want the function tokens_lookup(), after having defined a dictionary that defines the canonical party labels as keys, and lists all the ngram variations of the party names as values. By setting exclusive = FALSE it will keep the tokens that are not matched, in effect acting as a substitution of all variations with the canonical party names.

In the example below, I've modified your input text a bit to illustrate the ways that the party names will be combined to be different from the phrases using "liberal" but not "liberal party".

library("quanteda")

text<-c('a text about some political parties called the new democratic party 
         which is conservative the new democrats and the liberal party and the 
         liberals which are liberal helping poor people')
toks <- tokens(text)

partydict <- dictionary(list(
    olp = c("liberal party", "the liberals"),
    ndp = c("new democrats", "new democratic party")
))

(toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
## tokens from 1 document.
## text1 :
##  [1] "a"            "text"         "about"        "some"         "political"    "parties"     
##  [7] "called"       "the"          "NDP"          "which"        "is"           "conservative"
## [13] "the"          "NDP"          "and"          "the"          "OLP"          "and"         
## [19] "OLP"          "which"        "are"          "liberal"      "helping"      "poor"        
## [25] "people"

So that has replaced the party name variances with the party keys. Constructing a dfm from this new tokens now occurs on these new tokens, preserving the uses of (e.g.) "liberal" that might be linked to sentiment, but having already combined the "liberal party" and replaced it with "OLP". Applying a dictionary to the dfm will now work for your example of "liberal" in "liberal helping" without having confused it with the use of "liberal" in the party name.

sentdict <- dictionary(list(
    left = c("liberal", "left"),
    right = c("conservative", "")
))

dfm(toks2) %>%
    dfm_lookup(dictionary = sentdict, exclusive = FALSE)
## Document-feature matrix of: 1 document, 19 features (0% sparse).
## 1 x 19 sparse Matrix of class "dfm"
##        features
## docs    olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
##  text1   2   2 1    1     1    1         1       1      1   3     2  1     1   2    1   1       1
##        features
## docs    poor people
##  text1    1      1

Two additional notes:

If you do not want the keys uppercased in the replacement tokens, set capkeys = FALSE.
You can set different matching types using the valuetype argument, including valuetype = regex. (And note that your regular expression in the example is probably not correctly formed, since the scope of your | operator in the ndp example will get "new democrats" OR "new" and then " democratic party". But with tokens_lookup() you won't need to worry about that!)

Hi Ken, but this won't change the underlying text, will it? So if I construct a DFM later with a sentiment dictionary that contains "liberal", then "liberal" in "liberal party" will still appear, won't it? — spindoctor, Oct 05 '18 at 14:49
It wont change the underlying text, but will change the tokens (as in the example). So if you send the results of the `tokens_lookup()` call to `dfm()`, you will only have "olp", not "liberal" if it was part of the sequence "liberal party" - since the sequence of tokens `"liberal", "party"` was replaced by `"OLP"`. — Ken Benoit, Oct 05 '18 at 14:52
Sorry, I am still struggling with this. How would this relate then to constructing the dfm with a sentiment dictionary? — spindoctor, Oct 05 '18 at 15:22

substituting several ngrams in quanteda

1 Answers1