I am trying to partially match the contents of a column in a data set with a string of regular expressions. I would then like matching rows returning the particular matching regular expression in a new column. My actual data set is large (1.3 million rows) with 300 regular expressions so it's important to find an automated way of doing this so adding new regular expressions won't require code adaptation.
To demonstrate:
try.dat<-data.frame(c(1:10),c("hello","goodbye","tidings","partly","totally"))
names(try.dat)[1]<-"num"
names(try.dat)[2]<-"words"
try.dat
In this case, if one regular expression was 'ly' I would like to have a column with 'ly' in matching rows (partly, totally), and some 'non-matched' term in other rows. I have managed to successfully subset the data using grepl (subset not based on exact match) which works perfectly, but it's this next step I'm really struggling with!
I have had some progress at trying this, mostly based on this code suggestion (partial string matching R) which I have adapted as such:
pattern<-c("ll|ood")
matching<-c("ood","ll")
regexes<-data.frame(pattern,matching)
output_vector<-character(nrow(try.dat))
for(i in seq_along(regexes)){
output_vector[grepl(x=try.dat$words,pattern=regexes[[i]][1])] <- regexes [[i]][2]
}
try.dat$match<- output_vector
try.dat
As you can see this returns a '1' next to matched rows - getting there but I've run out of ideas! I was wondering if anyone could give any pointers?
Thanks!