2

For a long time, any time I've needed to use a regular expression, I've standardized on using the copyright symbol © as the delimiter because it was a symbol that wasn't on the keyboard that I was sure to not use in a regular expression, unlike ! @ # \ or / (which are sometimes all in use within in a regex).

Code:

$result=preg_match('©<.*?>©', '<something string>');

However, today I needed to use a regular expression with accented characters which included this:

Code:

[a-zA-ZàáâäãåąćęèéêëìíîïłńòóôöõøùúûüÿýżźñçčšžÀÁÂÄÃÅĄĆĘÈÉÊËÌÍÎÏŁŃÒÓÔÖÕØÙÚÛÜŸÝŻŹÑßÇŒÆČŠŽ∂ð \,\.\'-]+

After including this new regex in the PHP file in my IDE (Eclipse PDT), I was prompted to save the PHP file as UTF-8 instead of the default cp1252.

After saving and running the PHP file, every time I used a regex in a preg_match() or preg_replace() function call, it generated a generic PHP warning (Warning: preg_match in file.php on line x) and the regex was not processed.

So--two questions:

1) Is there another symbol that would be good to use as a delimiter that isn't typically found on a keyboard (`~!@#$%^&*()+=[]{};\':",./<>?|\) that I can standardize on and not worry about having to check each and every regex to see if that symbol is actually used somewhere in the expression?

2) Or, is there a I way I can use the copyright symbol as the standard delimiter when the file format is UTF-8?

Force Flow
  • 714
  • 2
  • 14
  • 34
  • 1
    As an aside comment: you can write the same character class like this: `[a-zA-ZÀ-ÖØ-öø-ýÿĄ-ćŒČŠŽ∂ð ,.\'-]`. Take a look at this link: http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF – Casimir et Hippolyte Jul 18 '13 at 14:58
  • That's certainly much more compact. Thanks! :) – Force Flow Jul 18 '13 at 15:02

1 Answers1

6

One thing that needs correcting is that if your regular expression and/or input data is encoded in UTF-8 (which in this case it is, since it comes straight from inside a UTF-8 encoded file) you must use the u modifier for your regular expression.

Another issue is that the copyright character should not be used as a delimiter in UTF-8 because the PCRE functions consider that the first byte of your pattern encodes your delimiter (this could plausibly be called a bug in PHP).

When you attempt to use the copyright sign as a delimiter in UTF-8, what actually gets saved into the file is the byte sequence 0xC2 0xA9. preg_match looks at the first byte 0xC2 and decides that it is an alphanumeric character because in your current locale that byte corresponds to the character Latin capital letter A with circumflex  (see extended ASCII table). Therefore a warning is generated and processing is immediately aborted.

Given these facts, the ideal solution would be to choose an unusual delimiter from inside the ASCII character set because that character would encode to the same byte sequence both in single byte encodings and in UTF-8.

I would not consider printable ASCII characters unusual enough for this purpose, so a good choice would be one of the control characters (ASCII codes 1 to 31). For example, STX (\x02) would fit the bill.

Together with the u regex modifier this means you should write the regex like this:

$result = preg_match("\x02<.*?>\x02u", '<something string>');
Jon
  • 428,835
  • 81
  • 738
  • 806
  • Thanks for the explanation and solution. That seems to have done the trick :) – Force Flow Jul 18 '13 at 15:03
  • 1
    The phrase "extended ASCII" makes me sad. There is no such thing as "8-bit ASCII"; there are various 8-bit encodings designed to be backwards-compatible with ASCII **and they all have names**. There is no reason to assume the OP's locale is set to ISO 8859-1 (which is what the page you link to shows) - although the character at 0xC2 happens to be the same in CP1252 and ISO 8859-15, which are also likely candidates. – IMSoP Aug 29 '13 at 22:22
  • Looking again, that page is even more muddling: it starts off saying the table is "according to ISO 8859-1" but then mentions "the Microsoft® Windows Latin-1 extended characters" - in other words, the table is actually of [Windows Code Page 1252](https://en.wikipedia.org/wiki/Windows-1252). – IMSoP Aug 29 '13 at 22:34
  • @IMSoP: this is valid criticism, thanks for taking the time to write it. I have made a number of simplifications here (assuming latin1 locale, saying "extended ASCII") because IMHO being 100% technically accurate would draw some attention away from the "interesting" part of the answer, communicating which was the main goal. Your comments are a very good way of dotting these "i"s. – Jon Aug 30 '13 at 13:36