0

I only found solutions in Python / Java for this question.

I have a data.frame with press articles and the corresponding dates. I further have a list of keywords that I want to check each article for.

df <- data.frame(c("2015-05-06", "2015-05-07", "2015-05-08", "2015-05-09"), 
                 c("Articel does not contain a key word", "Articel does contain the key word revenue", "Articel does contain two keywords revenue and margin","Articel does not contain the key word margin"))
colnames(df) <- c("date","article")

key.words <- c("revenue", "margin", "among others")

I came up with a nice solution, if I only want to check if one of the words is contained in an article:

article.containing.keyword <- filter(df, grepl(paste(key.words, collapse="|"), df$article))

This works well,but what I am actually looking for, is a solution where I can set a threshold a la "article must contain at least n words in order to be filtered", for example, an article must contain at least n = 2 keywords to get selected by the filter. The desired output would like like this:

  date       article
3 2015-05-08 Articel does contain two keywords revenue and margin
constiii
  • 638
  • 3
  • 19

1 Answers1

1

You could use stringr::str_count :

str_count(df$article, paste(key.words, collapse="|"))
[1] 0 1 2 1

That could be translated to filter this way :

article.containing.keyword <- dplyr::filter(df, str_count(df$article, paste(key.words, collapse="|")) >= 2)
        date                                              article
1 2015-05-08 Articel does contain two keywords revenue and margin
FlorianGD
  • 2,336
  • 1
  • 15
  • 32