1

I'm trying to determine which parts of a string match a specific named capture group, using stringi and R (and thus ICU regex). However, if the named capture group is the first child of an unnamed capture group, the name is lost in the output.

The contrived example is the following, the real is much more complex:

library(stringi)

stri_locate_all_regex("ab", "((?<letterone>[a-z])(?<lettertwo>[a-z]))", capture_groups = TRUE)
#> [[1]]
#>      start end
#> [1,]     1   2
#> attr(,"capture_groups")
#> attr(,"capture_groups")[[1]]
#>      start end
#> [1,]     1   2
#> 
#> attr(,"capture_groups")[[2]]
#>      start end
#> [1,]     1   1
#> 
#> attr(,"capture_groups")$lettertwo
#>      start end
#> [1,]     2   2

We see that capture group 2 appears to correspond to the named capture group letterone (it matches the first letter only), however, the name is lost in the output.

If it's not the first item in a capture group, it returns the expected output, even if the first item is a no-op, e.g. a{0}.

stri_locate_all_regex("ab", "(a{0}(?<letterone>[a-z])(?<lettertwo>[a-z]))", capture_groups = TRUE)
#> [[1]]
#>      start end
#> [1,]     1   2
#> attr(,"capture_groups")
#> attr(,"capture_groups")[[1]]
#>      start end
#> [1,]     1   2
#> 
#> attr(,"capture_groups")$letterone
#>      start end
#> [1,]     1   1
#> 
#> attr(,"capture_groups")$lettertwo
#>      start end
#> [1,]    2    2

Is there a way to extract named capture groups regardless of their position? And is this a known phenomenon I just don't know about, or a bug?

Erik A
  • 31,639
  • 12
  • 42
  • 67
  • Are you saying `((?[a-z])(?[a-z]))` does not display 3 capture groups as expected ? If not, it would appear to be a bug. – sln Jun 23 '23 at 22:38
  • @sln It does display 3 capture groups. The problem is, only one of them is named, while I expected two to be named – Erik A Jun 24 '23 at 05:41

2 Answers2

1

The named capture group support is followed in gagolews/stringi issue 153 and should work as expected, assuming stringi 1.7 or more recent (Q3 2021).

As a possible workaround, I would try and avoid nesting named capture groups within unnamed groups if you are observing this kind of behavior.
For example:

library(stringi)

# Avoid nesting named capture groups within unnamed groups
result <- stri_locate_all_regex("ab", "(?<letterone>[a-z])(?<lettertwo>[a-z])", capture_groups = TRUE)
print(result)

This will directly create two named capture groups without nesting them within an unnamed group.
However, if nesting is required for your actual, more complex use case, this may not be feasible.


Unfortunately, not using nested capture groups is not feasible for me.
The actual regexes I'm using are built up out of parts, and some consist of over 250 capture groups in a single regex, often about 7-10 levels deep.

Given the limitation in the stringi library due to its reliance on ICU, which as of the last information had not yet implemented stable support for named capture groups, you might consider using unique identifiers for capture groups and post-process the results to map them to names.

You can use unnamed capture groups in your regex and post-process the results to map group numbers to names.
This involves keeping a separate mapping of group numbers to names and applying this mapping after the match is made.

library(stringi)

# Your complex regex pattern with unnamed capture groups
pattern <- "(([a-z])([a-z]))"

# Mapping of capture group numbers to names
capture_group_names <- c("whole_match", "letterone", "lettertwo")

# Perform the regex match
result <- stri_locate_all_regex("ab", pattern, capture_groups = TRUE)

# Extract capture groups
capture_groups <- attr(result[[1]], "capture_groups")

# Create a named list to store the results with names
named_results <- list()

# Map the unnamed capture groups to names
for (i in 1:length(capture_group_names)) {
  start_end <- capture_groups[[i]]
  substring <- if (start_end[1, "start"] != -1) {
    # Extract substring if the start position is not -1
    stri_sub("ab", from = start_end[1, "start"], to = start_end[1, "end"])
  } else {
    # If start position is -1, it means no match, set to NA
    NA_character_
  }
  named_results[[capture_group_names[i]]] <- substring
}

# Display the results with names
print(named_results)

That would perform a regex match with unnamed capture groups, and then extracts the capture groups and maps them to names using a separate capture_group_names vector.
The final result is stored in a named list named_results, where each element of the list corresponds to a named capture group.

In your actual use case, you would have a much more complex pattern and a longer capture_group_names vector to match the structure of your regex.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Unfortunately, not using nested capture groups is not feasible for me. The actual regexes I'm using are built up out of parts, and some consist of over 250 capture groups in a single regex, often about 7-10 levels deep – Erik A Jun 23 '23 at 19:52
  • @ErikA OK. I have included your comment for more visibility, and suggested another alternative approach. – VonC Jun 23 '23 at 20:00
1

I think Boost and Perl had this parsing problem at one point too.
Seems regex designers can't parse worth a damn.

There is only one problem here.

It's that the parents IsNamed capture flag is set to FALSE. When the first construct in the descent is parsed it retains the parents IsNamed attribute.
So if the first construct is a named capt group it's IsNamed attribute becomes False as well.
This inherited action only occurs once and affects the First construct after every
un-named capture group starts it's parse.
It basically overwrites the first child construct IsNamed flag.
After the first (erroneous) overwrite, it does not attempt to again.

This is an atomic action and occurs wherever this parse series is encountered.

Graphically, in this sequence it's (here(?<P..)

This does not affect the regex at all in any other way unless the
named group can be referenced in recursion (?&captgrp) if R supports that.

To work around this do a replace, inserting a neutral construct in between : (?:)

stri_replace_all_regex("((?<name", "(?<=\\()(?=\\(\\?<\\w)", "(?:)")

https://www.mycompiler.io/view/58uzBJryBUG

So to test it :

stri_locate_all_regex("abc", "((?:)(?<letterone>[a-z])(?<lettertwo>[a-z])(?<letterthree>[a-z]))", capture_groups = TRUE)

https://www.mycompiler.io/view/0zBk9AtVLo7

And after this mild fix, there should not be any problem parsing as the second child, ie, named capture will not have its IsNamed flag overwritten.

sln
  • 2,071
  • 1
  • 3
  • 11