PHP regex and adjacent capturing groups

Question

I'm using capturing groups in regular expressions for the first time and I'm wondering what my problem is, as I assume that the regex engine looks through the string left-to-right.

I'm trying to convert an UpperCamelCase string into a hyphened-lowercase-string, so for example:

HelloWorldThisIsATest => hello-world-this-is-a-test

My precondition is an alphabetic string, so I don't need to worry about numbers or other characters. Here is what I tried:

mb_strtolower(preg_replace('/([A-Za-z])([A-Z])/', '$1-$2', "HelloWorldThisIsATest"));

The result:

hello-world-this-is-atest

This is almost what I want, except there should be a hyphen between a and test. I've already included A-Z in my first capturing group so I would assume that the engine sees AT and hyphenates that.

What am I doing wrong?

What about `"HelloWorldHTMLTest"`? Should that become `"hello-world-html-test"` or `"hello-world-h-t-m-l-test"`? — Ja͢ck, Jun 23 '14 at 06:58
@Jack Interesting use case I haven't thought of… I'd say the first one. — rink.attendant.6, Jun 23 '14 at 07:03

zx81 · Accepted Answer · 2014-06-23T07:07:59.260

6

The Reason your Regex will Not Work: Overlapping Matches

Your regex matches sA in IsATest, allowing you to insert a - between the s and the A
In order to insert a - between the A and the T, the regex would have to match AT.
This is impossible because the A is already matched as part of sA. You cannot have overlapping matches in direct regex.
Is all hope lost? No! This is a perfect situation for lookarounds.

Do it in Two Easy Lines

Here's the easy way to do it with regex:

$regex = '~(?<=[a-zA-Z])(?=[A-Z])~';
echo strtolower(preg_replace($regex,"-","HelloWorldThisIsATest"));

See the output at the bottom of the php demo:

Output: hello-world-this-is-a-test

Will add explanation in a moment. :)

The regex doesn't match any characters. Rather, it targets positions in the string: the positions between the change in letter case. To do so, it uses a lookbehind and a lookahead
The (?<=[a-zA-Z]) lookbehind asserts that what precedes the current position is a letter
The (?=[A-Z]) lookahead asserts that what follows the current position is an upper-case letter.
We just replace these positions with a -, and convert the lot to lowercase.

If you look carefully on this regex101 screen, you can see lines between the words, where the regex matches.

Reference

edited Jun 23 '14 at 07:07

answered Jun 23 '14 at 06:52

zx81

41,100
9
89
105

The reason your regex cannot work is that it would need to allow consecutive matches. Will add an explanation for this in a moment. You **must** use lookarounds as in my answer. – zx81 Jun 23 '14 at 07:04
Your question said `I'm wondering what my problem is`... My answer actually provides an explanation. – zx81 Jun 23 '14 at 07:09
I like your answer too as it provides an explanation as to why my regex didn't work. – rink.attendant.6 Jun 23 '14 at 07:13
your regex will match only two alphabets. – Avinash Raj Jun 23 '14 at 07:14
@AvinashRaj See my demo. :) – zx81 Jun 23 '14 at 07:15
@AvinashRaj Thanks, I know you're a regex scholar—this is the best solution for this kind of problem (zero-width match.) – zx81 Jun 23 '14 at 07:17

Ja͢ck · Answer 2 · 2014-06-23T07:35:52.917

5

I've separated the two regular expressions for simplicity:

preg_replace(array('/([a-z])([A-Z])/', '/([A-Z]+)([A-Z])/'), '$1-$2', $string);

It processes the string twice to find:

lowercase -> uppercase boundaries
multiple uppercase letters followed by another uppercase letter

This will have the following behaviour:

ThisIsHTMLTest -> This-Is-HTML-Test
ThisIsATest    -> This-Is-A-Test

Alternatively, use a look-ahead assertion (this will effect the reuse of the last capital letter that was used in the previous match):

preg_replace('/([A-Z]+|[a-z]+)(?=[A-Z])/', '$1-', $string);

edited Jun 23 '14 at 07:35

answered Jun 23 '14 at 07:15

Ja͢ck

170,779
38
263
309

`+1`, good and clear solution, also solves the use case of multiple upper case abbreviations – ohaal Jun 23 '14 at 07:17
Are we sure there are not cases where due to the character arrangement, we wouldn't need three passes?... Or more...? I have a hunch that this might be the case. – zx81 Jun 23 '14 at 07:24
+1 as it covers use cases that might apply to me in the future, also a good solution – rink.attendant.6 Jun 23 '14 at 07:24
@zx81 Although technically possible, with title cased names the acronyms you're using will be exclusively in uppercase. – Ja͢ck Jun 23 '14 at 07:32
@rink.attendant.6 Btw, I've managed to squeeze it all into one regular expression :) – Ja͢ck Jun 23 '14 at 07:37
Mmm, not so sure... It seems to me that with the first pass, the string gets expanded by an arbitrary nubmer of `-`, which could be odd or even... Therefore, two positions that were missed on pass one might end up with different parities. But my intuition could be wrong, I haven't stopped to work it out on paper. – zx81 Jun 23 '14 at 07:42
@zx81 Let me when you have; I'd be interested to see if there are cases that "fail" on it. – Ja͢ck Jun 23 '14 at 07:59
Hey Jack, IMO the best way to know for sure would be to run a test on a file with random sequences. Don't have time for it, but I took a Wikipedia article, removed the `[^a-zA-Z]+` and ran the two replacements. As you had guessed, there were quite a number of two or three upper-case ones left, none of the others. Would be worth trying on a larger sample as in normal text there are not that many case switches. See you another time! :) – zx81 Jun 23 '14 at 08:14

ohaal · Answer 3 · 2014-06-23T07:41:53.117

To fix the interesting use case Jack mentioned in your comments (avoid splitting of abbreviations), I went with zx81's route of using lookahead and lookbehinds.

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

You can split it in two for the explanation:

First part

(?<=                     look behind to see if there is:
  [a-z]                    any character of: 'a' to 'z'
)                        end of look-behind
(?=                      look ahead to see if there is:
  [A-Z]                    any character of: 'A' to 'Z'
)                        end of look-ahead

(TL;DR: Match between strings of the CamelCase Pattern.)

Second part

(?<=                     look behind to see if there is:
  [A-Z]                    any character of: 'A' to 'Z'
)                        end of look-behind
(?=                      look ahead to see if there is:
  [A-Z]                    any character of: 'A' to 'Z'
  [a-z]                    any character of: 'a' to 'z'
)                        end of look-ahead

(TL;DR: Special case, match between abbreviation and CamelCase pattern)

So your code would then be:

mb_strtolower(preg_replace('/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/', '-', "HelloWorldThisIsATest"));

Demo of matches

Demo of code

I was just about to comment asking for an explanation when I saw that the post was edited! Thank you, this is what I am looking for. I'll accept the answer in a few minutes. — rink.attendant.6, Jun 23 '14 at 06:53
They're all good answers but I'd have to accept the other one because it provided an explanation to the first part of my question. — rink.attendant.6, Jun 23 '14 at 07:18

PHP regex and adjacent capturing groups

3 Answers3

First part

Second part

Demo of matches

Demo of code