R tidytext Remove word if part of relevant bigrams, but keep if not

Question

By using unnest_token, I want to create a tidy text tibble which combines two different tokens: single words and bigrams. The reasoning behind is that sometimes single words are the more reasonable unit to study and sometime it is rather higher-order n-grams.

If two words show up as a "sensible" bigram, I want to store the bigram and not store the individual words. If the same words show up in a different context (i.e. not as bigram), then I want to save them as single words.

In the stupid example below "of the" is an important bigram. Thus, I want to remove single words "of" and "the" if they actually appear as "of the" in the text. But if "of" and "the" show up in other combinations, I would like to keep them as single words.

library(janeaustenr)
library(data.table)
library(dplyr)
library(tidytext)
library(tidyr)


# make unigrams
tide <- unnest_tokens(austen_books() , output = word, input = text )
# make bigrams
tide2 <- unnest_tokens(austen_books(), output = bigrams, input = text, token = "ngrams", n = 2)

# keep only most frequent bigrams (in reality use more sensible metric)
keepbigram <- names( sort( table(tide2$bigrams), decreasing = T)[1:10]  )
keepbigram
tide2 <- tide2[tide2$bigrams %in% keepbigram,]

# this removes all unigrams which show up in relevant bigrams
biwords <- unlist( strsplit( keepbigram, " ") )
biwords
tide[!(tide$word %in% biwords),]

# want to keep biwords in tide if they are not part of bigrams

score 3 · Accepted Answer · answered Mar 17 '20 at 13:17

You could do this by replacing the bigrams you're intrested in with a compound in text, before tokenisation (i.e. unnest_tokens):

keepbigram_new <- stringi::stri_replace_all_regex(keepbigram, "\\s+", "_")
keepbigram_new
#>  [1] "of_the"   "to_be"    "in_the"   "it_was"   "i_am"     "she_had" 
#>  [7] "of_her"   "to_the"   "she_was"  "had_been"

Using _ instead of whitespace is common practice for this. stringi::stri_replace_all_regex is pretty much the same as gsub or str_replace from stringr but a little faster and with more features.

Now replace the bigrams in text with these new compounds before tokenisation. I use word boundary regular expressions (\\b) at the beginning and end of the bigrams to not accidentally capture e.g., "of them":

topwords <- austen_books() %>% 
  mutate(text = stringi::stri_replace_all_regex(text, paste0("\\b", keepbigram, "\\b"), keepbigram_new, vectorize_all = FALSE)) %>% 
  unnest_tokens(output = word, input = text) %>% 
  count(word, sort = TRUE) %>% 
  mutate(rank = seq_along(word))

Looking at the most common words, the first bigram appears on rank 40 now:

topwords %>% 
  slice(1:4, 39:41)
#> # A tibble: 7 x 3
#>   word       n  rank
#>   <chr>  <int> <int>
#> 1 and    22515     1
#> 2 to     20152     2
#> 3 the    20072     3
#> 4 of     16984     4
#> 5 they    2983    39
#> 6 of_the  2833    40
#> 7 from    2795    41

R tidytext Remove word if part of relevant bigrams, but keep if not

1 Answers1