2

I need to search through a text string for keywords and then assign a category in an R dataframe. This creates a problem where I have keywords from more than one category. I would like to easily extract rows where more than one category is represented so that I can manually evaluate them and assign the correct category.

To do this, I have tried to add a count column to show how many categories are represented in each string.

Using a combination of the two solutions linked below, I have managed to get part of the way, but I am still not getting the correct output

Partial animal string matching in R

Count occurrences of specific words from a dataframe row in R

I have created an example below. I would like the following rules to be applied:

if string has cat or lion wcount gets 1 - only 1 group represented (feline)

if string has dog or wolf wcount gets 1 - only 1 group represented (canine)

if string has (cat or lion) AND (dog or wolf) wcount get 2 - two groups represented (feline and canine)

I can then easily pull out rows where wcount > 1

id <- c(1:5)
text <- c('saw a cat',
      'found a dog',
      'saw a cat by a dog',
      'There was a lion',
      'Huge wolf'
      )
dataset <- data.frame(id,text)

SearchGrp<-list(c("(cat|lion)", "feline"),
            c("(dog|wolf)","canine"))

output_vector<- character (nrow(dataset))

for (i in seq_along(SearchGrp)){

output_vector[grepl(x=dataset$text, pattern = SearchGrp[[i]][1],ignore.case = TRUE)]<-SearchGrp[[i]][2]}  

dataset$type<-output_vector


keyword_temp <- unlist(lapply(SearchGrp, function(x) new<-{x[1]}))
keyword<-paste(keyword_temp[1],"|",keyword_temp[2])

library(stringr)
getCount <- function(data,keyword)
 {
  wcount <- str_count(dataset$text, keyword)
  return(data.frame(data,wcount))
   }

getCount(dataset,keyword)
Cyrus
  • 84,225
  • 14
  • 89
  • 153
SR_111
  • 89
  • 4

1 Answers1

4

Here is a base R method to get the count across types.

dataset$wcnt <- rowSums(sapply(c("dog|wolf", "cat|lion"),
                               function(x) grepl(x, dataset$text)))

Here, sapply runs through the regular expressions of each type and feeds it to grepl. This returns a matrix, where the columns are logical vectors indicating if a particular type (eg, "dog|wolf") was found. rowSums sums the logicals along the rows to get the type variety count.

This returns

dataset
  id               text wcnt
1  1          saw a cat    1
2  2        found a dog    1
3  3 saw a cat by a dog    2
4  4   There was a lion    1
5  5          Huge wolf    1

If you want the intermediary step, returning logical vectors as variables in your data.frame, you would probably want to set your values up in a named vector and then do cbind with the result.

# construct named vector
myTypes <- c("canine"="dog|wolf", "feline"="cat|lion")
# cbind sapply results of logicals to original data.frame
dataset <- cbind(dataset, sapply(myTypes, function(x) grepl(x, dataset$text)))

This returns

dataset
  id               text canine feline
1  1          saw a cat  FALSE   TRUE
2  2        found a dog   TRUE  FALSE
3  3 saw a cat by a dog   TRUE   TRUE
4  4   There was a lion  FALSE   TRUE
5  5          Huge wolf   TRUE  FALSE
lmo
  • 37,904
  • 9
  • 56
  • 69
  • Thanks that works! - I can just use the first part of my code to categorise and then pass SearchGrp to to row sums – SR_111 Jul 14 '17 at 08:57