I'm trying to determine which parts of a string match a specific named capture group, using stringi and R (and thus ICU regex). However, if the named capture group is the first child of an unnamed capture group, the name is lost in the output.
The contrived example is the following, the real is much more complex:
library(stringi)
stri_locate_all_regex("ab", "((?<letterone>[a-z])(?<lettertwo>[a-z]))", capture_groups = TRUE)
#> [[1]]
#> start end
#> [1,] 1 2
#> attr(,"capture_groups")
#> attr(,"capture_groups")[[1]]
#> start end
#> [1,] 1 2
#>
#> attr(,"capture_groups")[[2]]
#> start end
#> [1,] 1 1
#>
#> attr(,"capture_groups")$lettertwo
#> start end
#> [1,] 2 2
We see that capture group 2 appears to correspond to the named capture group letterone (it matches the first letter only), however, the name is lost in the output.
If it's not the first item in a capture group, it returns the expected output, even if the first item is a no-op, e.g. a{0}
.
stri_locate_all_regex("ab", "(a{0}(?<letterone>[a-z])(?<lettertwo>[a-z]))", capture_groups = TRUE)
#> [[1]]
#> start end
#> [1,] 1 2
#> attr(,"capture_groups")
#> attr(,"capture_groups")[[1]]
#> start end
#> [1,] 1 2
#>
#> attr(,"capture_groups")$letterone
#> start end
#> [1,] 1 1
#>
#> attr(,"capture_groups")$lettertwo
#> start end
#> [1,] 2 2
Is there a way to extract named capture groups regardless of their position? And is this a known phenomenon I just don't know about, or a bug?