how to delete words from a list in a column in R

Question

I have a column of titles in a table and would like to delete all words that are listed in a separate table/vector.

For example, table of titles:

"Lorem ipsum dolor"
"sit amet, consectetur adipiscing"
"elit, sed do eiusmod tempor"
"incididunt ut labore"
"et dolore magna aliqua."

To be deleted: c("Lorem", "dolore", "elit")

output:

"ipsum dolor"
"sit amet, consectetur adipiscing"
", sed do eiusmod tempor"
"incididunt ut labore"
"et magna aliqua."

The blacklisted words can occur multiple times.

The tm package has this functionality, but when applied to a wordcloud. What I would need is to leave the column intact rather than joining all the rows into one string of characters. Regex functions (gsub())don't seem to function when given a set of values as a pattern. An Oracle SQL solution would also be interesting.

gsub(), https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html — Berecht, Dec 08 '15 at 14:55
thanks, but as written in the question, I wasn't able to use a set of values as a pattern for a regex expression - am I missing something? — Tomek P, Dec 08 '15 at 14:58

score 3 · Answer 1 · answered Dec 08 '15 at 14:59

3

lorem <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")

to.delete <- c("Lorem", "dolore", "elit")

output <- lorem
for (i in to.delete) {
  output <- gsub(i, "", output)
}

This gives:

[1] " ipsum dolor"                     "sit amet, consectetur adipiscing"
[3] ", sed do eiusmod tempor"          "incididunt ut labore"            
[5] "et  magna aliqua."

answered Dec 08 '15 at 14:59

maccruiskeen

2,748
2
13
23

1

Thanks a lot, I was also thinking about looping the gsub, I was just afraid if that's viable performance-wise: the to.delete list has several thousand words, so that would mean several thousand executions of gsub - could that be an issue? – Tomek P Dec 08 '15 at 15:02
It might be slow alright. @jeremycg's answer might run faster for you in that case. – maccruiskeen Dec 08 '15 at 15:05

score 2 · Accepted Answer · answered Dec 08 '15 at 15:00

2

First read the data:

dat <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")
todelete <- c("Lorem", "dolore", "elit")

We can avoid loops with a little smart pasting. The | is an or so we can paste it in, allowing us to remove any loops:

gsub(paste0(todelete, collapse = "|"), "", dat)

answered Dec 08 '15 at 15:00

jeremycg

24,657
5
63
74

for future readers - this method fails for long "todelete" vectors (thousands of words in my case), so it seems that loops are sometimes unavoidable. – Tomek P Dec 28 '15 at 10:53

score 2 · Answer 3 · answered Dec 08 '15 at 15:10

You could also use stri_replace_all_fixed:

library(stringi)
lorem <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")

to.delete <- c("Lorem", "dolore", "elit")

#just a simple function call
library(stringi)
stri_replace_all_fixed(lorem, to.delete, '')

Output:

[1] " ipsum dolor"                     "sit amet, consectetur adipiscing" ", sed do eiusmod tempor"         
[4] "incididunt ut labore"             "et  magna aliqua."

score 2 · Answer 4 · answered Dec 08 '15 at 15:12

The tm-Package has a function implemented for that: tm:::removeWords.character

It is implemented as follows:

foo <- function(x, words){
  gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), 
                                         collapse = "|")), "", x, perl = TRUE)
}

Which gives you

gsub("(*UCP)\\b(Lorem|elit|dolore)\\b","", x, perl = TRUE)

how to delete words from a list in a column in R

4 Answers4