1

I have a column of titles in a table and would like to delete all words that are listed in a separate table/vector.

For example, table of titles:

"Lorem ipsum dolor"
"sit amet, consectetur adipiscing"
"elit, sed do eiusmod tempor"
"incididunt ut labore"
"et dolore magna aliqua."

To be deleted: c("Lorem", "dolore", "elit")

output:

"ipsum dolor"
"sit amet, consectetur adipiscing"
", sed do eiusmod tempor"
"incididunt ut labore"
"et magna aliqua."

The blacklisted words can occur multiple times.

The tm package has this functionality, but when applied to a wordcloud. What I would need is to leave the column intact rather than joining all the rows into one string of characters. Regex functions (gsub())don't seem to function when given a set of values as a pattern. An Oracle SQL solution would also be interesting.

Tomek P
  • 57
  • 6

4 Answers4

3
lorem <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")

to.delete <- c("Lorem", "dolore", "elit")

output <- lorem
for (i in to.delete) {
  output <- gsub(i, "", output)
}

This gives:

[1] " ipsum dolor"                     "sit amet, consectetur adipiscing"
[3] ", sed do eiusmod tempor"          "incididunt ut labore"            
[5] "et  magna aliqua."
maccruiskeen
  • 2,748
  • 2
  • 13
  • 23
  • 1
    Thanks a lot, I was also thinking about looping the gsub, I was just afraid if that's viable performance-wise: the to.delete list has several thousand words, so that would mean several thousand executions of gsub - could that be an issue? – Tomek P Dec 08 '15 at 15:02
  • It might be slow alright. @jeremycg's answer might run faster for you in that case. – maccruiskeen Dec 08 '15 at 15:05
2

First read the data:

dat <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")
todelete <- c("Lorem", "dolore", "elit")

We can avoid loops with a little smart pasting. The | is an or so we can paste it in, allowing us to remove any loops:

gsub(paste0(todelete, collapse = "|"), "", dat)
jeremycg
  • 24,657
  • 5
  • 63
  • 74
  • for future readers - this method fails for long "todelete" vectors (thousands of words in my case), so it seems that loops are sometimes unavoidable. – Tomek P Dec 28 '15 at 10:53
2

You could also use stri_replace_all_fixed:

library(stringi)
lorem <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")

to.delete <- c("Lorem", "dolore", "elit")

#just a simple function call
library(stringi)
stri_replace_all_fixed(lorem, to.delete, '')

Output:

[1] " ipsum dolor"                     "sit amet, consectetur adipiscing" ", sed do eiusmod tempor"         
[4] "incididunt ut labore"             "et  magna aliqua."               
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
2

The tm-Package has a function implemented for that: tm:::removeWords.character

It is implemented as follows:

foo <- function(x, words){
  gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), 
                                         collapse = "|")), "", x, perl = TRUE)
}

Which gives you

gsub("(*UCP)\\b(Lorem|elit|dolore)\\b","", x, perl = TRUE)
Rentrop
  • 20,979
  • 10
  • 72
  • 100