Identify WHICH words in a document have been matched by dictionary lookup and how many times

Question

Quanteda question.

For each document in a corpus, I am trying to find out which of the words in a dictionary category contribute to the overall counts for that category, and how much.

Put differently, I want to get a matrix of the features in each dictionary category that have been matched using the tokens_lookup and dfm_lookup functions, and their frequency per document. So not the aggregated frequency of all words in the category, but of each of them separately.

Is there an easy way to get this?

score 1 · Accepted Answer · answered Jul 20 '20 at 17:57

The easiest way to do this is to iterate over your dictionary "keys" (what you call "categories") and select the matches to create one dfm per key. There are a few steps needed to deal with the non-matches and the compound dictionary values (such as "not fail").

I can demonstrate this using the built-in inaugural address corpus and the LSD2015 dictionary, which has four keys and includes multi-word values.

The loop iterates over the dictionary keys to build up a list, each time doing the following:

select the tokens but leave a pad for ones not selected;
compound the multi-word tokens into single tokens;
rename the pad ("") to OTHER, so that we can count non-matches; and
create the dfm.

library("quanteda")
## Package version: 2.1.0

toks <- tokens(tail(data_corpus_inaugural, 3))

dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
  this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
    tokens_compound(data_dictionary_LSD2015[key]) %>%
    tokens_replace("", "OTHER") %>%
    dfm(tolower = FALSE)
  dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)

Now we have all of the dictionary matches for each key in a list of dfm objects:

dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
##             features
## docs         clouds raging storms crisis war against violence hatred badly
##   2009-Obama      1      1      2      4   2       1        1      1     1
##   2013-Obama      0      1      1      1   3       1        0      0     0
##   2017-Trump      0      0      0      0   0       1        0      0     0
##             features
## docs         weakened
##   2009-Obama        1
##   2013-Obama        0
##   2017-Trump        0
## [ reached max_nfeat ... 170 more features ]
## 
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
##             features
## docs         grateful trust mindful thank well generosity cooperation
##   2009-Obama        1     2       1     1    2          1           2
##   2013-Obama        0     0       0     0    4          0           0
##   2017-Trump        1     0       0     1    0          0           0
##             features
## docs         prosperity peace skill
##   2009-Obama          3     4     1
##   2013-Obama          1     3     1
##   2017-Trump          1     0     0
## [ reached max_nfeat ... 246 more features ]
## 
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
##             features
## docs         not_apologize OTHER
##   2009-Obama             1  2687
##   2013-Obama             0  2317
##   2017-Trump             0  1660
## 
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
##             features
## docs         not_fight not_sap not_grudgingly not_fail OTHER
##   2009-Obama         0       0              1        0  2687
##   2013-Obama         1       1              0        0  2313
##   2017-Trump         0       0              0        1  1658

Identify WHICH words in a document have been matched by dictionary lookup and how many times

1 Answers1