0

I am looking to adjust this code so that I can assign each one of these modal verbs with a different weight. The idea is to use something similar to the NRC library, where we have the "numbers" 1-5 represent categories, rather than numbers.

modals<-data_frame(word=c("must", "will", "shall", "should", "may", "can"), 
modal=c("5", "4", "4", "3", "2", "1"))

My problem is that when I run the following code I have that 5 "may"s count as the same as one "must". What I want is for each word to have a different weight so that when I run this analysis I can see the concentration of uses of the stronger "must" versus say the much weaker "can". *with "tidy.DF" being my corpus and "school" and "target" being the column names.

MODAL<-tidy.DF %>%
  inner_join(modals) %>%
  count(School, Target, modal, index=wordnumber %/% 50, modal) %>%
  spread(modal, n, fill=0)

ggplot(MODAL, aes(index, 5, fill=Target)) +
  geom_col(show.legend=FALSE) +
  facet_wrap(~Target, ncol=2, scales="free_x")
Ken Benoit
  • 14,454
  • 27
  • 50
  • I think that what you are looking for is applying [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to your documents and then mulitplying it by modal or creating your own version of tfidf. But without a fully reproducible example and sort of expected output it is difficult to help you. – phiver Feb 11 '19 at 13:22

1 Answers1

0

Here's a suggestion for a better approach, using the quanteda package instead. The approach:

  1. Create a named vector of weights, corresponding to your "dictionary".
  2. Create a document feature matrix, selecting only the terms in the dictionary.
  3. Weight the observed counts.
# set modal values as a named numeric vector
modals <- c(5, 4, 4, 3, 2, 1)
names(modals) <- c("must", "will", "shall", "should", "may", "can")

library("quanteda", warn.conflicts = FALSE)
## Package version: 1.4.0
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

I'll use the most recent inaugural speeches as a reproducible example here.

dfmat <- data_corpus_inaugural %>%
  corpus_subset(Year > 2000) %>%
  dfm() %>%
  dfm_select(pattern = names(modals))

This produces the raw counts.

dfmat
## Document-feature matrix of: 5 documents, 6 features (26.7% sparse).
## 5 x 6 sparse Matrix of class "dfm"
##             features
## docs         will must can should may shall
##   2001-Bush    23    6   6      1   0     0
##   2005-Bush    22    6   7      1   3     0
##   2009-Obama   19    8  13      0   3     3
##   2013-Obama   20   17   7      0   4     0
##   2017-Trump   40    3   1      1   0     0

Weighting this now is as simple as calling dfm_weight() to reweight the counts by the values of your weight vector. The function will automatically apply the weights using fixed matching of the vector element names to the dfm features.

dfm_weight(dfmat, weight = modals)
## Document-feature matrix of: 5 documents, 6 features (26.7% sparse).
## 5 x 6 sparse Matrix of class "dfm"
##             features
## docs         will must can should may shall
##   2001-Bush    92   30   6      3   0     0
##   2005-Bush    88   30   7      3   6     0
##   2009-Obama   76   40  13      0   6    12
##   2013-Obama   80   85   7      0   8     0
##   2017-Trump  160   15   1      3   0     0
Ken Benoit
  • 14,454
  • 27
  • 50