R count how often words from a list appear in a sentence

Question

Currently participating in a MOOC and trying my hand at some sentiment analysis, but having trouble with the R code.

What I have is a list of bad words and a list of good words. For instance my bad words are c("dent", "broken", "wear", "cracked") ect.

I have a list of descriptions in my data frame, what I want to do is get a count on how many of my bad words appear in the list and how many of my good words appear for each row.

for instance suppose this is my data frame

desc = c("this screen is cracked", "minor dents and scratches", "100% good",     "in perfect condition")
id = c(1,2,3,4)
df = data.frame(id, desc)
bad.words = c("cracked", "scratches", "dents")

what I want is to make a sum column that counts how often each bad word appears in the description

so hoping my final df would look like

id    desc                        sum
1     "this screen is cracked"    1
2     "minor dents and scratches" 2
3     "100% good"                 0
4     "in perfect condition"      0

what I have so far is

df$sum <- grepl(paste( bad.words, collapse="|"), df$desc)

which only gets me a true or false if a word appears

sum(grepl(paste(bad.words, collapse="|"), description))? If so yeah gave that a try, but the result didn't seem right since all the columns had the same value — Wizuriel, Aug 02 '15 at 20:30
eventual goal would be to try and use regex so crack also matches cracked and or cracks — Wizuriel, Aug 02 '15 at 20:33
tried to make it a bit more clear as still not getting it to work with sapply — Wizuriel, Aug 02 '15 at 20:57
May be `colSums(sapply(df$desc, function(x) sapply(bad.words, function(y) sum(grepl(y,x)))))` or `sapply(strsplit(as.character(df$desc), ' '), function(x) sum(x %in% bad.words))` — akrun, Aug 02 '15 at 21:01

Rich Scriven · Answer 1 · 2015-08-02T21:36:16.133

3

If you are finding a sum, vapply() is more appropriate than sapply(). You could do

library(stringi)
df$sum <- vapply(df$desc, function(x) sum(stri_count_fixed(x, bad.words)), 1L)

Which gives

df
#   id                      desc sum
# 1  1    this screen is cracked   1
# 2  2 minor dents and scratches   2
# 3  3                 100% good   0
# 4  4      in perfect condition   0

edited Aug 02 '15 at 21:36

answered Aug 02 '15 at 21:30

Rich Scriven

97,041
11
181
245

score 1 · Answer 2 · answered Aug 02 '15 at 21:52

Since you're likely going to try different lists of words, like good.words, bad.words, really.bad.words; I would write a function. I like lapply, but vapply and others will work too.

countwords <- function(x,comparison){
  lapply(x,function(x,comparewords){
    sum(strsplit(x,' ')[[1]] %in% comparewords)
  },comparewords = comparison)
}
df$good <- countwords(df$desc,good.words)
df$bad <- countwords(df$desc,bad.words)

The tm package is useful as well, after you're content with learning and moving to production speed.

R count how often words from a list appear in a sentence

2 Answers2

Linked