8

I'm using capturing groups in regular expressions for the first time and I'm wondering what my problem is, as I assume that the regex engine looks through the string left-to-right.

I'm trying to convert an UpperCamelCase string into a hyphened-lowercase-string, so for example:

HelloWorldThisIsATest => hello-world-this-is-a-test

My precondition is an alphabetic string, so I don't need to worry about numbers or other characters. Here is what I tried:

mb_strtolower(preg_replace('/([A-Za-z])([A-Z])/', '$1-$2', "HelloWorldThisIsATest"));

The result:

hello-world-this-is-atest

This is almost what I want, except there should be a hyphen between a and test. I've already included A-Z in my first capturing group so I would assume that the engine sees AT and hyphenates that.

What am I doing wrong?

rink.attendant.6
  • 44,500
  • 61
  • 101
  • 156

3 Answers3

6

The Reason your Regex will Not Work: Overlapping Matches

  • Your regex matches sA in IsATest, allowing you to insert a - between the s and the A
  • In order to insert a - between the A and the T, the regex would have to match AT.
  • This is impossible because the A is already matched as part of sA. You cannot have overlapping matches in direct regex.
  • Is all hope lost? No! This is a perfect situation for lookarounds.

Do it in Two Easy Lines

Here's the easy way to do it with regex:

$regex = '~(?<=[a-zA-Z])(?=[A-Z])~';
echo strtolower(preg_replace($regex,"-","HelloWorldThisIsATest"));

See the output at the bottom of the php demo:

Output: hello-world-this-is-a-test

Will add explanation in a moment. :)

  • The regex doesn't match any characters. Rather, it targets positions in the string: the positions between the change in letter case. To do so, it uses a lookbehind and a lookahead
  • The (?<=[a-zA-Z]) lookbehind asserts that what precedes the current position is a letter
  • The (?=[A-Z]) lookahead asserts that what follows the current position is an upper-case letter.
  • We just replace these positions with a -, and convert the lot to lowercase.

If you look carefully on this regex101 screen, you can see lines between the words, where the regex matches.

Reference

zx81
  • 41,100
  • 9
  • 89
  • 105
5

I've separated the two regular expressions for simplicity:

preg_replace(array('/([a-z])([A-Z])/', '/([A-Z]+)([A-Z])/'), '$1-$2', $string);

It processes the string twice to find:

  1. lowercase -> uppercase boundaries
  2. multiple uppercase letters followed by another uppercase letter

This will have the following behaviour:

ThisIsHTMLTest -> This-Is-HTML-Test
ThisIsATest    -> This-Is-A-Test

Alternatively, use a look-ahead assertion (this will effect the reuse of the last capital letter that was used in the previous match):

preg_replace('/([A-Z]+|[a-z]+)(?=[A-Z])/', '$1-', $string);
Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • `+1`, good and clear solution, also solves the use case of multiple upper case abbreviations – ohaal Jun 23 '14 at 07:17
  • Are we sure there are not cases where due to the character arrangement, we wouldn't need three passes?... Or more...? I have a hunch that this might be the case. – zx81 Jun 23 '14 at 07:24
  • +1 as it covers use cases that might apply to me in the future, also a good solution – rink.attendant.6 Jun 23 '14 at 07:24
  • @zx81 Although technically possible, with title cased names the acronyms you're using will be exclusively in uppercase. – Ja͢ck Jun 23 '14 at 07:32
  • @rink.attendant.6 Btw, I've managed to squeeze it all into one regular expression :) – Ja͢ck Jun 23 '14 at 07:37
  • Mmm, not so sure... It seems to me that with the first pass, the string gets expanded by an arbitrary nubmer of `-`, which could be odd or even... Therefore, two positions that were missed on pass one might end up with different parities. But my intuition could be wrong, I haven't stopped to work it out on paper. – zx81 Jun 23 '14 at 07:42
  • @zx81 Let me when you have; I'd be interested to see if there are cases that "fail" on it. – Ja͢ck Jun 23 '14 at 07:59
  • Hey Jack, IMO the best way to know for sure would be to run a test on a file with random sequences. Don't have time for it, but I took a Wikipedia article, removed the `[^a-zA-Z]+` and ran the two replacements. As you had guessed, there were quite a number of two or three upper-case ones left, none of the others. Would be worth trying on a larger sample as in normal text there are not that many case switches. See you another time! :) – zx81 Jun 23 '14 at 08:14
4

To fix the interesting use case Jack mentioned in your comments (avoid splitting of abbreviations), I went with zx81's route of using lookahead and lookbehinds.

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

You can split it in two for the explanation:

First part

(?<=                     look behind to see if there is:
  [a-z]                    any character of: 'a' to 'z'
)                        end of look-behind
(?=                      look ahead to see if there is:
  [A-Z]                    any character of: 'A' to 'Z'
)                        end of look-ahead

(TL;DR: Match between strings of the CamelCase Pattern.)

Second part

(?<=                     look behind to see if there is:
  [A-Z]                    any character of: 'A' to 'Z'
)                        end of look-behind
(?=                      look ahead to see if there is:
  [A-Z]                    any character of: 'A' to 'Z'
  [a-z]                    any character of: 'a' to 'z'
)                        end of look-ahead

(TL;DR: Special case, match between abbreviation and CamelCase pattern)

So your code would then be:

mb_strtolower(preg_replace('/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/', '-', "HelloWorldThisIsATest"));

Demo of matches

Demo of code

ohaal
  • 5,208
  • 2
  • 34
  • 53
  • I was just about to comment asking for an explanation when I saw that the post was edited! Thank you, this is what I am looking for. I'll accept the answer in a few minutes. – rink.attendant.6 Jun 23 '14 at 06:53
  • They're all good answers but I'd have to accept the other one because it provided an explanation to the first part of my question. – rink.attendant.6 Jun 23 '14 at 07:18