1

I am currently working on a large number of judicial documents. They contain a number of fixed phrases (e.g. Council directive) which due to their frequent occurrence have no meaning for my analysis. Therefore, I would like to remove them. Using a personalised list of stop words would not work, as the individual words bear meaning in a different context.

So far, I have used the tidytext package. My initial idea was to convert the text into bigrams and use dplyr::anti_join(). However, this will not get rid of the phrase entirely. For example, "according to Council directive 453-EL [...]" would become "to Council", "Council directive", and "directive 453".

Does anybody have a neat way to solve this problem? Ideally, I would like to avoid converting my text to bigrams in the first place. Here is the code for a reproducible example:

library(dplyr)
library(tidytext)

text <- "according to Council directive 453-EL" %>% data.frame()
colnames(text) <- c("word")

txt_bigrams <- text %>% unnest_tokens(ngram, word, token = "ngrams", n = 2)

Thank you!

banannanas
  • 11
  • 2
  • This is a common step of text cleaning in text mining. I would suggest you create a list of meaningless phrases and remove them either from the text to be mined, or from the dataset of ngrams you get after `unnest_tokens()` – Nicolás Velasquez Mar 13 '23 at 18:50
  • 1
    Why not create a list of stop phrases and filter them out? `dplyr::filter(!(phrases %in% stop_phrases))` – AcademicDialysis Mar 13 '23 at 18:50

1 Answers1

0

If you use the quanteda package it you can remove your list of custom stopwords very easily. There is even a phrase function that can be used as a pattern to remove combined stopwords like "Council directive".

The only thing you need to make sure of is that the stopwords match the text. So if you use uppercase in your stopwords, like "Council" and it is in lowercase in the text, it will not match.

library(dplyr)

text <- "according to Council directive 453-EL" %>% data.frame()
colnames(text) <- "word"

stop_phrases <- "Council directive"

library(quanteda)

my_corp <- corpus(text$word)

my_toks <- tokens(my_corp) # will not use tolower, use function tokens_tolower() for that (or dfm()).
my_toks <- tokens_remove(my_toks, pattern = phrase(stop_phrases))

my_toks
Tokens consisting of 1 document.
text1 :
[1] "according" "to"        "453-EL"  
phiver
  • 23,048
  • 14
  • 44
  • 56