How do you specify a regex character range that will work in European languages other than English?

Question

I'm working with Ruby's regex engine. I need to write a regex that does this

WIKI_WORD = /\b([a-z][\w_]+\.)?[A-Z][a-z]+[A-Z]\w*\b/

but will also work in other European languages besides English. I don't think that the character range [a-z] will cover lowercase letters in German, etc.

Usually Ruby 1.9. I can require Ruby 1.9 in my gem if necessary. — dan, Feb 15 '11 at 14:33
As I understand it, Ruby 1.9 has far better Unicode support than 1.8<, so my guess is that Tim's suggestion should work with 1.9. — Bart Kiers, Feb 15 '11 at 15:01

score 7 · Accepted Answer · edited May 23 '17 at 12:04

7

WIKI_WORD = /\b(\p{Ll}\w+\.)?\p{Lu}\p{Ll}+\p{Lu}\w*\b/u

should work in Ruby 1.9. \p{Lu} and \p{Ll} are shorthands for uppercase and lowercase Unicode letters. (\w already includes the underscore)

See also this answer - you might need to run Ruby in UTF-8 mode for this to work, and possibly your script must be encoded in UTF-8, too.

edited May 23 '17 at 12:04

Community

1
1

answered Feb 15 '11 at 14:51

Tim Pietzcker

328,213
58
503
561

What does \p and /u at the end do? – dan Feb 15 '11 at 15:55
`\p` is a shorthand for [Unicode character property](http://www.regular-expressions.info/unicode.html#prop). `/u` is the Unicode modifier. It appears to be necessary to tell Ruby that it should interpret the regex in Unicode mode, but I'm not really sure about this, and I haven't found a conclusive piece of documentation on this yet. Most stuff I have is still for Ruby 1.8. – Tim Pietzcker Feb 15 '11 at 16:29
According to Programming Ruby 1.9 when constructing a Regexp from a string "the encoding of the string determines the encoding of the regular expression". So unless otherwise specified the string argument will be encoded using `Encoding.default_external`, which defaults to UTF-8. I think. – zetetic Feb 15 '11 at 18:26
StackOverflow could and probably should use something more like your WɪᴋɪWᴏʀᴅ pattern. I noticed [in this answer](http://stackoverflow.com/questions/5127725/how-could-i-catch-an-unicode-non-character-warning/5128605#5128605) that they weren’t using `\p{Lu}` properly, because it missed my Greek identifier. (Yes, I know it may be in bad taste; I was just play around.) – tchrist Feb 27 '11 at 11:26

score 1 · Answer 2 · answered Feb 15 '11 at 18:46

James Grey wrote a series of articles on working with Unicode, UTF-8 and Ruby 1.8.7 and 1.9.2. They're important reading.

With Ruby 1.8.7, we could add:

#!/usr/bin/ruby -kU
require 'jcode'

and get partial UTF-8 support.

With 1.9.2 you can use:

# encoding: UTF-8

as the second line of your source file and that will tell Ruby to default to UTF-8. Grey's recommendation is we do that with all source we write from now on.

That will not affect external encoding when reading/writing text, only the encoding of the source code.

Ruby 1.9.2 doesn't extend the usual \w, \W and \s character classes to handle UTF-8 or Unicode. As the other comments and answers said, only the POSIX and Unicode character-sets in regex do that.

How do you specify a regex character range that will work in European languages other than English?

2 Answers2

Linked