I am having trouble removing profanities from my n-grams. The getProfanityWords function below correctly creates a character vector. The whole script works in every other way, but the profanities remain.
I did wonder whether it was to do with the hyphens in the 2 and 3 grams, but it applies to the 1-grams too.
getProfanityWords <- function() {
# Download profanity file to disk if not done so already
profanityFileName <- "profanity.txt"
if (!file.exists(profanityFileName)) {
profanity.url <- "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
download.file(profanity.url, destfile = profanityFileName, method = "curl")
}
# if profanity file not in memory, then load it
if (sum(ls() == "profanity") < 1) {
profanity <- read.csv(profanityFileName, header = FALSE, stringsAsFactors = FALSE)
profanity <- profanity$V1
profanity <- profanity[1:length(profanity)-1]
}
return(profanity)
}
makeSentences <- function(input) {
output <- tokens(input, what = "sentence", remove_numbers = TRUE,
remove_punct = TRUE, remove_separators = TRUE,
remove_hyphens = TRUE,
remove_twitter = TRUE,
remove_symbols = TRUE,
include_docvars = FALSE)
output <- tokens_remove(output, getProfanityWords())
unlist(output)
}
makeNGrams <- function(text, n = 1L) {
tokens(
text,
what = "word",
remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE,
remove_twitter = TRUE,
remove_symbols = TRUE,
ngrams = n
)
}
corpora <- corpus(textData)
sentences <- makeSentences(corpora)
ngram1 <- makeNGrams(sentences, 1)
dfm1 <- dfm(ngram1)
ngram2 <- makeNGrams(sentences, 2)
dfm2 <- dfm(ngram2)
ngram3 <- makeNGrams(sentences, 3)
dfm3 <- dfm(ngram3)
I have tried adding in
dfm3 <- dfm(ngram3, remove=getProfanityWords())
and also similar within the makeNGrams function, but it makes no difference.
What am I doing wrong?
Thanks,
Chris.