How to specify Regexp for unicode cyrillic characters in Ruby 1.9

Question

#coding: utf-8
str2 = "asdfМикимаус"
p str2.encoding #<Encoding:UTF-8> 
p str2.scan /\p{Cyrillic}/ #found all cyrillic characters
str2.gsub!(/\w/u,'') #removes only latin characters
puts str2

The question is why \w ignore cyrillic characters?

I have installed latest ruby package from http://rubyinstaller.org/. Here is my output of ruby -v

ruby 1.9.1p378 (2010-01-10 revision 26273) [i386-mingw32]

As far as i know 1.9 oniguruma regular expression library has full support for unicode characters.

on Linux (ruby 1.9) gsub remove all characters - irb(main):006:0> str2.gsub(/\w/u,'') => "" — andrykonchin, Apr 27 '10 at 14:58
@aaz: it shouldn't (see my answer); probably you have an old version? — Marc-André Lafortune, Apr 27 '10 at 17:28
I would rename this question as "How to specify Regexp for unicode characters in Ruby 1.9", since this is not related to win32 nor to (only) cyrillic. — Marc-André Lafortune, Apr 27 '10 at 17:41
you are right. its a bug in ruby 1.9.1p0, in ruby 1.9.1p376 all works well — andrykonchin, Apr 27 '10 at 20:20

Marc-André Lafortune · Accepted Answer · 2012-07-16T01:15:13.177

11

This is as specified in the Ruby documentation: \w is equivalent to [a-zA-Z0-9_] and thus doesn't target any unicode character.

You probably want to use [[:alnum:]] instead, which includes all unicode alphabetic and numeric characters. Check also [[:word:]] and [[:alpha:]].

edited Jul 16 '12 at 01:15

answered Apr 27 '10 at 17:26

Marc-André Lafortune

78,216
16
166
166

BTW, we can thank Run Paint Run Run for writing this documentation. – Marc-André Lafortune Apr 27 '10 at 17:51

How to specify Regexp for unicode cyrillic characters in Ruby 1.9

1 Answers1