2

why pattern

[A-Z][A-z]*

return Ve for French word Vénus using NSRegularExpression .I want to match camel word,but this word is strange

Tony Lee
  • 31
  • 2
  • 1
    Did you search for an answer before asking? "When specifying a range of characters, such as [a-Z] (i.e. lowercase a to upper-case z), the computer's locale settings determine the contents by the numeric ordering of the character encoding." - https://en.wikipedia.org/wiki/Regular_expression#Character_classes – Onots Jan 08 '15 at 04:04
  • 1
    @Onots: It is not the case for NSRegularExpression. What you quoted is the behavior of POSIX regular expression, which is not applicable here. – nhahtdh Jan 08 '15 at 05:29
  • 1
    @nhahtdh: Thanks for pointing that out. So I googled and learned something today: NSRegularExpression uses the pattern syntax specified by ICU. From the ICU site: "[A-M] Range - match any character from A to M. The characters to include are determined by Unicode code point ordering". – Onots Jan 08 '15 at 06:05
  • 1
    But why "[A-Z][A-z']*" apply to "Vénus" return "Ve",not "Vé" or "Venus" or "Vénus" using NSRegularExpression – Tony Lee Jan 08 '15 at 08:04

1 Answers1

2

The reason why your regex matches Ve and not is because there are two ways to represent an é in Unicode:

  • Using the normalized single codepoint U+00E9 or
  • Using the "decomposed" form: e, followed by the combining mark ´ (U+0065 U+0301). Note that the latter is not the actual "standalone" ´ character (U+00B4).

Your string is apparently encoded using the second option. Therefore [A-z] only matches the first half of the combined character. Since the following ´ doesn't match, the regex stops at this point. You should normalize the string first before applying a regex to it.

Furthermore, use [A-Za-z] instead of [A-z]. Otherwise, some non-letter characters like ^ or ] will also be matched.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561