I need to search through a text string for keywords and then assign a category in an R dataframe. This creates a problem where I have keywords from more than one category. I would like to easily extract rows where more than one category is represented so that I can manually evaluate them and assign the correct category.
To do this, I have tried to add a count column to show how many categories are represented in each string.
Using a combination of the two solutions linked below, I have managed to get part of the way, but I am still not getting the correct output
Partial animal string matching in R
Count occurrences of specific words from a dataframe row in R
I have created an example below. I would like the following rules to be applied:
if string has cat or lion wcount gets 1 - only 1 group represented (feline)
if string has dog or wolf wcount gets 1 - only 1 group represented (canine)
if string has (cat or lion) AND (dog or wolf) wcount get 2 - two groups represented (feline and canine)
I can then easily pull out rows where wcount > 1
id <- c(1:5)
text <- c('saw a cat',
'found a dog',
'saw a cat by a dog',
'There was a lion',
'Huge wolf'
)
dataset <- data.frame(id,text)
SearchGrp<-list(c("(cat|lion)", "feline"),
c("(dog|wolf)","canine"))
output_vector<- character (nrow(dataset))
for (i in seq_along(SearchGrp)){
output_vector[grepl(x=dataset$text, pattern = SearchGrp[[i]][1],ignore.case = TRUE)]<-SearchGrp[[i]][2]}
dataset$type<-output_vector
keyword_temp <- unlist(lapply(SearchGrp, function(x) new<-{x[1]}))
keyword<-paste(keyword_temp[1],"|",keyword_temp[2])
library(stringr)
getCount <- function(data,keyword)
{
wcount <- str_count(dataset$text, keyword)
return(data.frame(data,wcount))
}
getCount(dataset,keyword)