I'm preprocessing some text data for further analysis. I tokenized the text using unnest_tokens() [into singular words] but want to keep certain commonly-occuring 2 word phrases such as "United States" or "social security." How can I do this using tidytext?
tidy_data <- data %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
dput(data[1:6, 1:6])
structure(list(race = c("US House", "US House", "US House", "US House",
"", "US House"), district = c(8L, 3L, 6L, 17L, 2L, 1L), party = c("Republican",
"Republican", "Republican", "Republican", "", "Republican"),
state = c("AZ", "AZ", "KY", "TX", "IL", "NH"), sponsor = c(4,
4, 4, 1, NA, 4), approve = structure(c(1L, 1L, 1L, 4L, NA,
1L), .Label = c("no oral statement of approval, authorization",
"beginning of the spot", "middle of the spot", "end of the spot"
), class = "factor")), row.names = c(NA, 6L), class = "data.frame")