4

I would need to get a Regular Expression, which matches all Unicode control characters except for carriage return (0x0d), line feed (0x0a) and tabulator (0x09). Currently, my Regular Expression looks like this:

/\p{C}/u

I just need to define these three exceptions now.

Mathias Bynens
  • 144,855
  • 52
  • 216
  • 248
Tower
  • 98,741
  • 129
  • 357
  • 507
  • Is this for PHP? To give you the best answer we need to know which regex flavor you're using. Darth Eru's answer will work in PHP, but other flavors would require a different approach. – Alan Moore Jul 05 '09 at 04:30
  • Oh, sorry. Yes, Perl Compatible Regular Expressions which PHP uses. – Tower Jul 05 '09 at 10:41

1 Answers1

7

I think you can use a negative lookahead here, combined with character classes.

/(?![\x{000d}\x{000a}\x{0009}])\p{C}/u

What this does is use a negative lookahead to assert that the character is not one of those specified in the character class. Then it traverses the character again to match it with any control character.

I used the perl syntax for specifying single unicode points.

More discussion on lookarounds here

(Note that this has not been tested, but I think the concept is correct.)

Sean
  • 4,450
  • 25
  • 22
  • It's a real shame that .NET `Regex` doesn't just have a IsControlCharacter unicode named block to match `System.Globalization.UnicodeCategory.Control`. It would be great if I could just use that instead of manually specifying all the control characters. – Jez Jun 27 '13 at 08:38