2

I have a string variable in a large data set that I want to cleanse based on a set list of strings. ex. pattern <- c("dog","cat") but my list will be about 400 elements long.

vector_to_clean == a

black Dog
white dOG
doggie
black CAT
thatdamcat

Then I want to apply a function to yield

new

dog
dog
dog
cat
cat

I have tried str_extract, grep, grepl etc.. Since I can pick a pattern based on one string at a time. I think what I want is to use dapply with one of these text cleansing functions. Unfortunately, I'm stuck. Below is my latest attempt. Thank you for your help!

new <- vector()

lapply(pattern, function(x){
  where<- grep(x,a,value = FALSE, ignore.case = TRUE)
  new[where]<-x
  })
Kara_F
  • 163
  • 8

2 Answers2

5

We paste the 'pattern' vector together to create a single string, use that to extract the words from 'vec1' after we change it to lower case (tolower(vec1)).

library(stringr)
str_extract(tolower(vec1), paste(pattern, collapse='|'))
#[1] "dog" "dog" "dog" "cat" "cat"

data

pattern <- c("dog","cat") 
vec1 <- c('black Dog', 'white dOG', 'doggie','black CAT', 'thatdamcat')
akrun
  • 874,273
  • 37
  • 540
  • 662
4

Another way using base R is:

#data
vec <- c('black Dog', 'white dOG', 'doggie','black CAT','thatdamcat')

#regexpr finds the locations of cat and dog ignoring the cases
a <- regexpr( 'dog|cat', vec, ignore.case=TRUE )

#regmatches returns the above locations from vec (here we use tolower in order 
#to convert to lowercase)
regmatches(tolower(vec), a)
[1] "dog" "dog" "dog" "cat" "cat"
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • I tried this one out too and it works, but it isn't quite what I want because I also have 'bird' in my dataset and I want an NA placeholder for that. My explanation mistake. Thank you! – Kara_F Oct 23 '15 at 23:21