I am currently working on a large number of judicial documents. They contain a number of fixed phrases (e.g. Council directive) which due to their frequent occurrence have no meaning for my analysis. Therefore, I would like to remove them. Using a personalised list of stop words would not work, as the individual words bear meaning in a different context.
So far, I have used the tidytext package. My initial idea was to convert the text into bigrams and use
dplyr::anti_join().
However, this will not get rid of the phrase entirely. For example, "according to Council directive 453-EL [...]" would become "to Council", "Council directive", and "directive 453".
Does anybody have a neat way to solve this problem? Ideally, I would like to avoid converting my text to bigrams in the first place. Here is the code for a reproducible example:
library(dplyr)
library(tidytext)
text <- "according to Council directive 453-EL" %>% data.frame()
colnames(text) <- c("word")
txt_bigrams <- text %>% unnest_tokens(ngram, word, token = "ngrams", n = 2)
Thank you!