By using unnest_token
, I want to create a tidy text tibble which combines two different tokens: single words and bigrams.
The reasoning behind is that sometimes single words are the more reasonable unit to study and sometime it is rather higher-order n-grams.
If two words show up as a "sensible" bigram, I want to store the bigram and not store the individual words. If the same words show up in a different context (i.e. not as bigram), then I want to save them as single words.
In the stupid example below "of the" is an important bigram. Thus, I want to remove single words "of" and "the" if they actually appear as "of the" in the text. But if "of" and "the" show up in other combinations, I would like to keep them as single words.
library(janeaustenr)
library(data.table)
library(dplyr)
library(tidytext)
library(tidyr)
# make unigrams
tide <- unnest_tokens(austen_books() , output = word, input = text )
# make bigrams
tide2 <- unnest_tokens(austen_books(), output = bigrams, input = text, token = "ngrams", n = 2)
# keep only most frequent bigrams (in reality use more sensible metric)
keepbigram <- names( sort( table(tide2$bigrams), decreasing = T)[1:10] )
keepbigram
tide2 <- tide2[tide2$bigrams %in% keepbigram,]
# this removes all unigrams which show up in relevant bigrams
biwords <- unlist( strsplit( keepbigram, " ") )
biwords
tide[!(tide$word %in% biwords),]
# want to keep biwords in tide if they are not part of bigrams