13

Given a regular expression containing capture groups (parentheses) and a string, how can I obtain all the substrings matching the capture groups, i.e., the substrings usually referenced by "\1", "\2"?

Example: consider a regex capturing digits preceded by "xy":

s <- "xy1234wz98xy567"

r <- "xy(\\d+)"

Desired result:

[1] "1234" "567" 

First attempt: gregexpr:

regmatches(s,gregexpr(r,s))
#[[1]]
#[1] "xy1234" "xy567" 

Not what I want because it returns the substrings matching the entire pattern.

Second try: regexec:

regmatches(s,regexec("xy(\\d+)",s))
#[[1]]
#[1] "xy1234" "1234" 

Not what I want because it returns only the first occurence of a matching for the entire pattern and the capture group.

If there was a gregexec function, extending regexec as gregexpr extends regexpr, my problem would be solved.

So the question is: how to retrieve all substrings (or indices that can be passed to regmatches as in the examples above) matching capture groups in an arbitrary regular expression?

Note: the pattern for r given above is just a silly example, it must remain arbitrary.

Andy G
  • 19,232
  • 5
  • 47
  • 69
Ferdinand.kraft
  • 12,579
  • 10
  • 47
  • 69

4 Answers4

12

For a base R solution, what about just using gsub() to finish processing the strings extracted by gregexpr() and regmatches()?

s <- "xy1234wz98xy567"
r <- "xy(\\d+)"

gsub(r, "\\1", regmatches(s,gregexpr(r,s))[[1]])
# [1] "1234" "567" 
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Nice trick, but It may fail if, for instance, the capture group is inside a lookaround (`perl=TRUE`). – Ferdinand.kraft Sep 06 '13 at 21:00
  • @Ferdinand.kraft -- Would you give an example? I'm thinking that lookarounds aren't captured, so am not being able to see what you mean. – Josh O'Brien Sep 06 '13 at 21:09
  • Consider this: `r <- "xy(?=(\\d+))"; gsub(r, "\\1", regmatches(s,gregexpr(r,s,perl=TRUE))[[1]], perl=TRUE); gsub(r, "\\1", s, perl=TRUE)`. Note that `gsub` (alone) can retrive the digits in `\1`. – Ferdinand.kraft Sep 06 '13 at 21:30
  • @Ferdinand.kraft -- Interesting to see that they are indeed captured (even if they aren't 'eaten up' by the regex algorithm). – Josh O'Brien Sep 06 '13 at 22:00
11

Not sure about doing this in base, but here's a package for your needs:

library(stringr)

str_match_all(s, r)
#[[1]]
#     [,1]     [,2]  
#[1,] "xy1234" "1234"
#[2,] "xy567"  "567" 

Many stringr functions also have parallels in base R, so you can also achieve this without using stringr.

For instance, here's a simplified version of how the above works, using base R:

sapply(regmatches(s,gregexpr(r,s))[[1]], function(m) regmatches(m,regexec(r,m)))
bschneidr
  • 6,014
  • 1
  • 37
  • 52
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    This is exactly what I need. I'll check its source. I believe there is (or should be) a solution in base R, given that this is a basic task. – Ferdinand.kraft Sep 04 '13 at 19:48
  • 1
    It just uses `lapply` and `regexec` (though cleverly)... Just type `str_match_all` and `str_match` to see this... – Arun Sep 04 '13 at 20:15
8

strapplyc in the gsubfn package does that:

> library(gsubfn)
>
> strapplyc(s, r)
[[1]]
[1] "1234" "567" 

Try ?strapplyc for additional info and examples.

Related Functions

1) A generalization of strapplyc is strapply in the same package. It takes a function which inputs the captured portions of each match and returns the output of the function. When the function is c it reduces to strapplyc. For example, suppose we wish to return results as numeric:

> strapply(s, r, as.numeric)
[[1]]
[1] 1234  567

2) gsubfn is another related function in the same package. It is like gsub except the replacement string can be a replacement function (or a replacement list or a replacement proto object). The replacement function inputs the captured portions and outputs the replacement. The replacement replaces the match in the input string. If a formula is used, as in this example, the right hand side of the formula is regarded as the function body. In this example we replace the match with XY{#} where # is twice the matched input number.

> gsubfn(r, ~ paste0("XY{", 2 * as.numeric(x), "}"), s)
[1] "XY{2468}wz98XY{1134}"

UPDATE: Added strapply and gsubfn examples.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This is great. One question: does `strapplyc` call `regexec` in a loop, like `stringr::str_match_all`? – Ferdinand.kraft Sep 05 '13 at 13:03
  • `strapplyc` uses underlying code written in tcl (a string processing language) to enable it to handle very large strings and for speed (unless the `engine=` argument specifies a different engine). See `?strapplyc` for details. – G. Grothendieck Sep 05 '13 at 13:08
  • Thank you a lot for this answer. I accepted eddi's because he is correct and answered first. – Ferdinand.kraft Sep 11 '13 at 18:50
0

Since R 4.1.0, there is gregexec:

regmatches(s,gregexec(r,s))[[1]][2, ]
[1] "1234" "567"
starja
  • 9,887
  • 1
  • 13
  • 28