7

I got really confused about the usage of backreferences

strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12")

gsub("(ab) 12", "\\1 34", strings)
[1] "^ab"   "ab"    "abc"   "abd"   "abe"   "ab 34"

gsub("(ab)12", "\\2 34", strings)
[1] "^ab"   "ab"    "abc"   "abd"   "abe"   "ab 12"

I know \1 refers to the first subpattern (reading from the left), \2 refers to the second subpattern, and so on. But I dont know what this subpattern means. Why \1 and \2 give different output

gsub("(ab)", "\\1 34", strings)
[1] "^ab 34"   "ab 34"    "ab 34c"   "ab 34d"   "ab 34e"   "ab 34 12"

Also, why I remove 12 after (ab) then it gives such result?

gsub("ab", "\\1 34", strings)
[1] "^ 34"   " 34"    " 34c"   " 34d"   " 34e"   " 34 12"

Furthermore, what if ab has no parenthesis? What does it indicate?

I really messed up with backreference and hope someone could explain the logic clearly

zx8754
  • 52,746
  • 12
  • 114
  • 209
Bratt Swan
  • 1,068
  • 3
  • 16
  • 28
  • 1
    It's not a "subpattern", but a *capture group*. If you google it, you'll find a lot of resources. Any pattern inside brackets `()` is a capture group. Anyway, I don't get the same results in your first example. The last element is `ab 34` and not `ab 12`. – nicola Jul 31 '16 at 07:20
  • Yes, you are right, i have pasted a incorrectly output. – Bratt Swan Jul 31 '16 at 16:00

1 Answers1

14

In the first and second case, there is a single capture group i.e. groups that are captured using (...), however in the first case replacement we use the backreference correctly i.e. the first capture group and in the second case, used \\2 which never existed.

To illustrate it

gsub("(ab)(d)", "\\1 34", strings)
#[1] "^ab"   "ab"    "abc"   "ab 34" "abe"   "ab 12"

here we are using two capture groups ((ab) and (d)), in the replacement we have first backreference (\\1) followed by a space followed by 34. So, in 'strings' this will match the 4th element i.e. "abd", get "ab" for the first backreference (\\1) followed by a space and 34.

Suppose, we do with the second backreference

gsub("(ab)(d)", "\\2 34", strings)
#[1] "^ab"   "ab"    "abc"   "d 34"  "abe"   "ab 12"

the first one is removed and we have "d" followed by space and 34.

Suppose, we are using a general case instead of specific characters

gsub("([a-z]+)\\s*(\\d+)", "\\1 34", strings)
#[1] "^ab"   "ab"    "abc"   "abd"   "abe"   "ab 34"
gsub("([a-z]+)\\s*(\\d+)", "\\2 34", strings)
#[1] "^ab"   "ab"    "abc"   "abd"   "abe"   "12 34"

Note how the values are changed in the last element by switching from first backreference to second. The pattern used is one or more lower case letters (inside the capture group (([a-z]+)) followed by zero or more space (\\s*) followed by one or more numbers in the second capture group ((\\d+)) (this matches only with the last element of 'strings'). In the replacement, we use the first and second backreference as showed above.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • This makes more sense. But I am still confused about `gsub("([a-z]+)\\s*(\\d+)", "\\2 34", strings)`. As you said, it matches "ab 12", and you used \\2 to capture the second group. What it implies to me is that when you capture a group, that group will be fixed, so "ab 12" will be switched to "34 12" but not "12 34" – Bratt Swan Jul 31 '16 at 15:56
  • 1
    @BrattSwan In the concerned example, I am replacing with the second capture group i.e. `(\\d+`)`, so it returns `12` . As we are also creating a space followed by 34, it will be `"12 34"`. If you want to switch to "34 12", the regex would be `"34 \\2"`. – akrun Jul 31 '16 at 16:16