The easiest way to do this is to iterate over your dictionary "keys" (what you call "categories") and select the matches to create one dfm per key. There are a few steps needed to deal with the non-matches and the compound dictionary values (such as "not fail").
I can demonstrate this using the built-in inaugural address corpus and the LSD2015 dictionary, which has four keys and includes multi-word values.
The loop iterates over the dictionary keys to build up a list, each time doing the following:
- select the tokens but leave a pad for ones not selected;
- compound the multi-word tokens into single tokens;
- rename the pad (
""
) to OTHER
, so that we can count non-matches; and
- create the dfm.
library("quanteda")
## Package version: 2.1.0
toks <- tokens(tail(data_corpus_inaugural, 3))
dfm_list <- list()
for (key in names(data_dictionary_LSD2015)) {
this_dfm <- tokens_select(toks, data_dictionary_LSD2015[key], pad = TRUE) %>%
tokens_compound(data_dictionary_LSD2015[key]) %>%
tokens_replace("", "OTHER") %>%
dfm(tolower = FALSE)
dfm_list <- c(dfm_list, this_dfm)
}
names(dfm_list) <- names(data_dictionary_LSD2015)
Now we have all of the dictionary matches for each key in a list of dfm objects:
dfm_list
## $negative
## Document-feature matrix of: 3 documents, 180 features (60.0% sparse) and 4 docvars.
## features
## docs clouds raging storms crisis war against violence hatred badly
## 2009-Obama 1 1 2 4 2 1 1 1 1
## 2013-Obama 0 1 1 1 3 1 0 0 0
## 2017-Trump 0 0 0 0 0 1 0 0 0
## features
## docs weakened
## 2009-Obama 1
## 2013-Obama 0
## 2017-Trump 0
## [ reached max_nfeat ... 170 more features ]
##
## $positive
## Document-feature matrix of: 3 documents, 256 features (53.0% sparse) and 4 docvars.
## features
## docs grateful trust mindful thank well generosity cooperation
## 2009-Obama 1 2 1 1 2 1 2
## 2013-Obama 0 0 0 0 4 0 0
## 2017-Trump 1 0 0 1 0 0 0
## features
## docs prosperity peace skill
## 2009-Obama 3 4 1
## 2013-Obama 1 3 1
## 2017-Trump 1 0 0
## [ reached max_nfeat ... 246 more features ]
##
## $neg_positive
## Document-feature matrix of: 3 documents, 2 features (33.3% sparse) and 4 docvars.
## features
## docs not_apologize OTHER
## 2009-Obama 1 2687
## 2013-Obama 0 2317
## 2017-Trump 0 1660
##
## $neg_negative
## Document-feature matrix of: 3 documents, 5 features (53.3% sparse) and 4 docvars.
## features
## docs not_fight not_sap not_grudgingly not_fail OTHER
## 2009-Obama 0 0 1 0 2687
## 2013-Obama 1 1 0 0 2313
## 2017-Trump 0 0 0 1 1658