2

I am having trouble removing profanities from my n-grams. The getProfanityWords function below correctly creates a character vector. The whole script works in every other way, but the profanities remain.

I did wonder whether it was to do with the hyphens in the 2 and 3 grams, but it applies to the 1-grams too.

getProfanityWords <- function() {
    # Download profanity file to disk if not done so already
    profanityFileName <- "profanity.txt"
    if (!file.exists(profanityFileName)) {
        profanity.url <- "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
    download.file(profanity.url, destfile = profanityFileName, method = "curl")
}

    # if profanity file not in memory, then load it
    if (sum(ls() == "profanity") < 1) {
         profanity <- read.csv(profanityFileName, header = FALSE, stringsAsFactors = FALSE)
        profanity <- profanity$V1
        profanity <- profanity[1:length(profanity)-1]
    }
    return(profanity)
}

makeSentences <- function(input) {
    output <- tokens(input, what = "sentence", remove_numbers = TRUE,
                 remove_punct = TRUE, remove_separators = TRUE,
                 remove_hyphens = TRUE,
                 remove_twitter = TRUE,
                 remove_symbols = TRUE,
                 include_docvars = FALSE)
    output <- tokens_remove(output, getProfanityWords())
    unlist(output)
}

makeNGrams <- function(text, n = 1L) {
    tokens(
        text,
        what = "word",
        remove_numbers = TRUE,
        remove_punct = TRUE,
        remove_separators = TRUE,
        remove_twitter = TRUE,
        remove_symbols = TRUE,
        ngrams = n
    )
}

corpora <- corpus(textData)
sentences <- makeSentences(corpora)

ngram1 <- makeNGrams(sentences, 1)
dfm1 <- dfm(ngram1)
ngram2 <- makeNGrams(sentences, 2)
dfm2 <- dfm(ngram2)
ngram3 <- makeNGrams(sentences, 3)
dfm3 <- dfm(ngram3)

I have tried adding in

dfm3 <- dfm(ngram3, remove=getProfanityWords())

and also similar within the makeNGrams function, but it makes no difference.

What am I doing wrong?

Thanks,

Chris.

Chris
  • 1,449
  • 1
  • 18
  • 39

1 Answers1

0

I think I got a solution for you.

tokens_remove is meant for removing words, not for removing parts of a sentence.

But tokens_remove works very well with a dictionary object. So first step is wrapping the profanity in a dictionary.

dict <- dictionary(list(bad_words = getProfanityWords()))

Next you can use wrap a tokens_remove in your makeNGrams function.

makeNGrams <- function(text, n = 1L) {
  out <- tokens_remove(tokens(text), dict)
  tokens(
    out,
    what = "word",
    remove_numbers = TRUE,
    remove_punct = TRUE,
    remove_separators = TRUE,
    remove_twitter = TRUE,
    remove_symbols = TRUE,
    ngrams = n
  )
  
}

This should remove the profanity words out of your text. It does in the simple sample I created for myself. Not posted here due to profanity rules :-)

added function

Here is the function makeSentences I used. Combined with the code above it works like expected. I don't seem to be able to reproduce your error.

makeSentences <- function(input) {
  output <- tokens(input, what = "sentence", remove_numbers = TRUE,
                   remove_punct = TRUE, remove_separators = TRUE,
                   remove_hyphens = TRUE,
                   remove_twitter = TRUE,
                   remove_symbols = TRUE,
                   
                   include_docvars = FALSE)
  unlist(output)
}

# txt <- "add profane text example here"
corpora <- corpus(txt)
sentences <- makeSentences(corpora)
ngram1 <- makeNGrams(sentences, 1)
ngram2 <- makeNGrams(sentences, 2)
ngram3 <- makeNGrams(sentences, 3)
Community
  • 1
  • 1
phiver
  • 23,048
  • 14
  • 44
  • 56
  • Thanks @phiver - looks promising. When I run it, makeNGrams returns with Error in qatd_cpp_tokens_replace(x, type, ids_pat, ids_repl) : Not compatible with requested type: [type=NULL; target=double]. Do I need to specify which item in the list (bad_words)? – Chris Aug 30 '19 at 14:34