2

I am trying to partially match the contents of a column in a data set with a string of regular expressions. I would then like matching rows returning the particular matching regular expression in a new column. My actual data set is large (1.3 million rows) with 300 regular expressions so it's important to find an automated way of doing this so adding new regular expressions won't require code adaptation.

To demonstrate:

try.dat<-data.frame(c(1:10),c("hello","goodbye","tidings","partly","totally"))
names(try.dat)[1]<-"num"
names(try.dat)[2]<-"words"
try.dat

In this case, if one regular expression was 'ly' I would like to have a column with 'ly' in matching rows (partly, totally), and some 'non-matched' term in other rows. I have managed to successfully subset the data using grepl (subset not based on exact match) which works perfectly, but it's this next step I'm really struggling with!

I have had some progress at trying this, mostly based on this code suggestion (partial string matching R) which I have adapted as such:

pattern<-c("ll|ood")
matching<-c("ood","ll")
regexes<-data.frame(pattern,matching)
output_vector<-character(nrow(try.dat))
for(i in seq_along(regexes)){
output_vector[grepl(x=try.dat$words,pattern=regexes[[i]][1])] <- regexes    [[i]][2]    
}
try.dat$match<- output_vector
try.dat

As you can see this returns a '1' next to matched rows - getting there but I've run out of ideas! I was wondering if anyone could give any pointers?

Thanks!

Community
  • 1
  • 1
D.Singleton
  • 179
  • 2
  • 12
  • Not sure what you want as the output. A new `match` column with the index of regex matches? Can you post the desired output column? – Pierre L Oct 06 '15 at 22:25
  • Yes you're right @PierreLafortune so for the word 'partly' and regular expression 'ly', I would like 'ly' in the match column. To broaden out, my actual dataset is based on item codes for pharmaceutical prescriptions which vary in spellings/trade names etc. - my regular expressions will provide the link to a standardised phrase which will then allow me to fit the item code with the actual product dispensed - hard to explain when the dataset is too massive to post! – D.Singleton Oct 06 '15 at 22:27
  • If you want to know *which* pattern matched, you'll need to do two `grep`s. You should also have a contingency plan in case *both* match. Your question will be clearer if you unify your example. You initially talk about matching `"ly"`, but when you share code it's `"ll|ood"`. – Gregor Thomas Oct 06 '15 at 22:32
  • Apologies @Gregor so to clarify: word = hello, reg ex = ll, match = ll – D.Singleton Oct 06 '15 at 22:34
  • Do you mind giving a quick example of the double grep, please? I've had a bit of a go with stringr, but I got to a point where it didn't seem to work so pursued other avenues – D.Singleton Oct 06 '15 at 22:40
  • To be clear, you want to match an entire vector of regexes against a string column. **Should your output be an array of logicals** as returned by grepl (1:match, 0:no match), or a **column of string** (/vector of strings) of the regex that hit (how to handle multiple matches)? – smci Oct 06 '15 at 23:01

3 Answers3

2

I think this will do?

library(stringr)
try.dat$match = str_extract(try.dat$words, "ll|ood")
try.dat
#    num   words match
# 1    1   hello    ll
# 2    2 goodbye   ood
# 3    3 tidings  <NA>
# 4    4  partly  <NA>
# 5    5 totally    ll
# 6    6   hello    ll
# 7    7 goodbye   ood
# 8    8 tidings  <NA>
# 9    9  partly  <NA>
# 10  10 totally    ll

The default behavior is to extract the first match. You can use str_extract_all if you want to get all matches, but you'll need a non-data.frame set-up in that case that can handle different numbers of matches.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • That's fantastic, thank you! Works for me on this data set - out of interest for you, as my actual data set has many reg ex's I had a quick check to make sure str_extract understands summarised strings i.e. pattern<-"ll | ood", which it does. As far as first match goes, I've worked on my reg ex's so there should only be one match - so hopefully won't have to worry about str_extract_all. I'll have a go with this on the full dataset in work tomorrow and let you know how I get on. – D.Singleton Oct 06 '15 at 22:49
  • Glad to hear. As a note for the future, your question would have been *much* clearer if you had included a few lines of the end result as example output. When I posted this answer I had read your question three times and I was only about 70% sure this was what you wanted. – Gregor Thomas Oct 06 '15 at 23:11
  • Thanks for the pointers, new to posting on stack overflow and about data handling topics so good to know what will make things clearer. As a further note, this solution works perfectly for my actual data set… had to adapt a couple of reg ex's to be more specific due the extracting first match (as you noted), but this was pretty easy and preferable to having multiple matches which could really complicate further analysis. Thanks again! – D.Singleton Oct 07 '15 at 20:43
1

A base R option. Just because.

patt <- c("ll", "ood")
for (i in 1: length(patt)) {
  try.dat[grep(patt[i], try.dat$words), "match"] <- patt[i]
}
try.dat
#    num  words match
#1    1   hello    ll
#2    2 goodbye   ood
#3    3 tidings  <NA>
#4    4  partly  <NA>
#5    5 totally    ll
#6    6   hello    ll
#7    7 goodbye   ood
#8    8 tidings  <NA>
#9    9  partly  <NA>
#10  10 totally    ll
GGeco
  • 28
  • 7
  • Thanks - interesting and always good to see a base R way of doing things! out of interest did a run time on each - see answer below – D.Singleton Dec 20 '15 at 16:27
1

Run time comparisons of both based on a dataset expanded to 10M rows (Macbook Pro OS X):

try.dat<-data.frame(c(1:10000000),c("hello","goodbye","tidings","partly","totally"))
system.time(try.dat[str_extract(try.dat$words,"ll|ood"),"match"])

user system elapsed

5.167 0.208 5.348

system.time(for (i in 1: length(patt)) {try.dat[grep(patt[i], try.dat$words), "match"] <- patt[i]})

user system elapsed

0.311 0.041 0.377

Indications so far are that the base R version offers a significant increase in efficiency. Will try this on my actual dataset (400< reg ex's over 2M rows and see if it continues to beat the package version. Cheers!

D.Singleton
  • 179
  • 2
  • 12