4

I need some help figuring out the regex for XML character references to control characters, in decimal or hex.

These sequences look like the following:

�





In other words, they are an ampersand, followed by a pound, followed by an optional 'x' to denote hexadecimal mode, followed by 1 to 4 decimal (or hexadecimal) digits, followed by a semicolon.

I'm specifically trying to identify those sequences where they contain (inclusive) numbers from decimal 0 to 31, or hexadecimal 0 to 1F.

Can anyone figure out the regex for this??

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Ken Mason
  • 712
  • 2
  • 7
  • 16
  • Do you want to accept leading zeroes? Such as your second case? If so, how many leading zeroes are acceptable? ``? – carlpett Sep 15 '11 at 20:34
  • Yes, leading 0's need to be tolerated; there can be at most 4 digits, including leading 0's. Thus, 3, 03, 003, and 0003 are all valid and refer to the same character. – Ken Mason Sep 15 '11 at 20:39

3 Answers3

3
&#(0{0,2}[1-2]\d|000\d|0{0,2}3[01]|x0{0,2}[01][0-9A-Fa-f]);

It's not the most elegant, but it should work.

Verified in RegexBuddy.

results

Nicolas Webb
  • 1,312
  • 10
  • 22
3

If you use a zero-width lookahead assertion to restrict the number of digits, you can write the rest of the pattern without worrying about the length restriction. Try this:

&#(?=x?[0-9A-Fa-f]{1,4})0*([12]?\d|3[01]|x0*1?[0-9A-Fa-f]);

Explanation:

(?=x?[0-9A-Fa-f]{1,4})  #Restricts the numeric portion to at most four digits, including leading zeroes.
0*                      #Consumes leading zeroes if there is no x.
[12]?\d                 #Allows decimal numbers 0 - 29, inclusive.
3[01]                   #Allows decimal 30 or 31.
x0*1?[0-9A-Fa-f]        #Allows hexadecimal 0 - 1F, inclusive, regardless of case or leading zeroes.

This pattern allows leading zeroes after the x, but the (?=x?[0-9A-Fa-f]{1,4}) part prevents them from occurring before an x.

Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
2

I think the following should work:

&#(?:x0{0,2}[01]?[0-9a-fA-F]|0{0,2}(?:[012]?[0-9]|3[01]));

Here is a Rubular:
http://www.rubular.com/r/VEYx25Fdpj

Andrew Clark
  • 202,379
  • 35
  • 273
  • 306