The way that the NRC word-emotion association lexicon was built makes it a pretty good fit for social media data as it exists already, so I recommend taking a look at the details of where it comes from before making changes to it for your analysis. However, if you decide that for your purposes, you need to add words to such a sentiment lexicon, the first step is to add the words to the dataset row-wise, via perhaps bind_rows()
. Let's say, perhaps, that you think "darcy" is a positive word and "wickham" is a negative word.
library(tidyverse)
library(tidytext)
nrc_lexicon <- get_sentiments("nrc")
custom_lexicon <- nrc_lexicon %>%
bind_rows(tribble(~word, ~sentiment,
"darcy", "positive",
"wickham", "negative"))
Now, when you want to implement sentiment analysis, you can treat either one of these dataframes in the same way. If you have text data (say, the text of Pride and Prejudice), you can first tidy it using unnest_tokens()
and then implement sentiment analysis using an inner_join()
.
tidy_PandP <- tibble(text = janeaustenr::prideprejudice) %>%
unnest_tokens(word, text)
tidy_PandP %>%
inner_join(nrc_lexicon)
#> Joining, by = "word"
#> # A tibble: 29,651 x 2
#> word sentiment
#> <chr> <chr>
#> 1 pride joy
#> 2 pride positive
#> 3 prejudice anger
#> 4 prejudice negative
#> 5 truth positive
#> 6 truth trust
#> 7 possession anger
#> 8 possession disgust
#> 9 possession fear
#> 10 possession negative
#> # … with 29,641 more rows
tidy_PandP %>%
inner_join(custom_lexicon)
#> Joining, by = "word"
#> # A tibble: 30,186 x 2
#> word sentiment
#> <chr> <chr>
#> 1 pride joy
#> 2 pride positive
#> 3 prejudice anger
#> 4 prejudice negative
#> 5 truth positive
#> 6 truth trust
#> 7 possession anger
#> 8 possession disgust
#> 9 possession fear
#> 10 possession negative
#> # … with 30,176 more rows
Created on 2019-08-03 by the reprex package (v0.3.0)
Notice that you can implement the sentiment analysis for either lexicon (the original one or the one to which we added words) in the same way.
I do want to note that the license for the NRC lexicon allows it to be used for research purposes for free, but for any commercial use, you must contact the NRC researchers and pay for a commercial license.