1

I have two dataframes: msnbc contains a column of news transcripts called text and dictionary contains a column of words called search. I want to return a new dataframe that includes all rows of msnbc where the text field contains one or more words from the search column. Toy data:

msnbc <- data.frame(id=c(1,2,3), text=c("hello world", "goodbye world","hello friends"))
dictionary <- data.frame(search=c("hello","lorem","ipsum","dolor")

The new dataset should include the first and third element of msnbc because they include one of the words from dictionary$search

My first thought was to use str_detect but there is no option for passing a vector of strings as the pattern. My other idea was to use filter somehow but not sure how to implement:

new_msnbc <- msnbc %>%
    filter(dictionary$search %in% text)

But this doesn't work as intended. What is the best way to do this? Bonus points for a tidyverse solution.

James Martherus
  • 1,033
  • 1
  • 9
  • 20
  • try `grepl` for things like this. `%in%` is not the correct operator. – cory Sep 26 '19 at 19:40
  • grepl doesn't take a vector of strings, only a regex pattern or a single string. I thought there might be a solution that allows a character vector as the pattern for matching. – James Martherus Sep 26 '19 at 19:43
  • 1
    Yes, so concatenate them together. https://www.regular-expressions.info/alternation.html – cory Sep 26 '19 at 19:44

1 Answers1

1

It appears you can do this with filter and grepl:

result <- msnbc %>%
filter(grepl(paste(dictionary$search, collapse="|"), text))
James Martherus
  • 1,033
  • 1
  • 9
  • 20