Scanning for Unicode Numbers in a string with \d

Question

According to the Oniguruma documentation, the \d character type matches:

decimal digit char
Unicode: General_Category -- Decimal_Number

However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched:

#encoding: utf-8
require 'open-uri'
html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read
digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*')

puts digits.encoding, digits
#=> UTF-8
#=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨…

p RUBY_DESCRIPTION, digits.scan(/\d/)
#=> "ruby 1.9.2p180 (2011-02-18) [i386-mingw32]"
#=> ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

Am I misreading the documentation? Why doesn't \d match other Unicode numerals, and/or is there a way to make it do so?

Yes, Ruby regexes have lots of super-annoying problems with Unicode. See slides 15–20 on Ruby from my recent OSCON [Unicode Support Shootout](http://training.perl.com/OSCON2011/index.html) talk, especially the last one. On the other hand, it does do full casefolding, the only engine apart from Perl that does so. But since Ruby doesn’t meet [UTS#18’s Level 1 conformance requirements](http://unicode.org/reports/tr18/) for the most basic possible Unicode regex functionality, you’re pretty much out of luck. You’d need to use ICU or Perl for real Unicode work I am afraid. — tchrist, Aug 09 '11 at 19:16

Phrogz · Accepted Answer · 2011-08-10T02:42:04.170

2

Noted by Brian Candler on ruby-talk:

\w only matches ASCII letters and digits, while [[:alpha:]] matches the full set of Unicode letters.
\d only matches ASCII digits, while [[:digit:]] matches the full set of Unicode numbers.

The behavior is thus 'consistent', and we have a simple workaround for Unicode numbers. Reading up on \w in the same Oniguruma doc we see the text:

\w  word character  
    Not Unicode: alphanumeric, "_" and multibyte char.  
    Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)

In light of the real behavior of Ruby and the "Not Unicode" text above, it would appear that the documentation is describing two modes—a Unicode mode and a Not Unicode mode—and that Ruby is operating in the Not Unicode mode.

This would explain why \d does not match the full Unicode set: although the Oniguruma documentation fails to describe exactly what is matched when in Not Unicode mode, we now know that the behavior documented as "Unicode" is not to be expected.

p "abç".scan(/\w/), "abç".scan(/[[:alpha:]]/)
#=> ["a", "b"]
#=> ["a", "b", "\u00E7"]

It is left as an exercise to the reader to discover how (if at all) to enable Unicode mode in Ruby regexps, as the /u flag (e.g. /\w/u) does not do it. (Perhaps Ruby must be recompiled with a special flag for Oniguruma.)

Update: It would appear that the Oniguruma document I have linked to is not accurate for Ruby 1.9. See this ticket discussion, including these posts:

[Yui NARUSE] "RE.txt is for original Oniguruma, not for Ruby 1.9's regexp. We may need our own document."
[Matz] "Our Oniguruma is forked one. The original Oniguruma found in geocities.jp has not been changed."

Better Reference: Here is official documentation on Ruby 1.9's regexp syntax:
https://github.com/ruby/ruby/blob/trunk/doc/re.rdoc

edited Aug 10 '11 at 02:42

answered Aug 09 '11 at 22:30

Phrogz

296,393
112
651
745

`\w` does not match Latin letters or digits. It matches ASCII ones. Latin != ASCII. Shame shame. And I dare you to try to emulate a UTS#18 verison of `\w` using Ruby’s paltry properties: it cannot be done. Plus all the POSIX properties are out of spec. Some even lie. For example, alpha **is not supposed to mean the letters!!** Therefore, it sucks. – tchrist Aug 10 '11 at 01:53
1

@tchrist Thank you for the correct terminology; I'll edit my answer to match. I humbly suggest that whether or not Ruby's handling of Unicode 'sucks' is not an appropriate topic for Stack Overflow discussion. Further, assuming that you downvoted my answer based on the incorrect wording, I request that you read the question again and base your voting on whether or not what I've written properly answers the question. – Phrogz Aug 10 '11 at 02:40
The thing is that it doesn't match the standard. It uses things the standard says do one thing, but it uses them in another way. This is very misleading. For example, Ⓜ is an (other_)alphabetic character, and a word character, but Ruby can’t match it that way because it is a symbol. This is an error. See UTS#18. There are worse problems, too. – tchrist Aug 10 '11 at 02:41
2

@tchrist I'm not claiming you're wrong, simply that what you are discussing is entirely unrelated to this question. It would appear that you're using this question to vent your frustration at what you perceive to be Ruby's poor handling of Unicode (limited or not to regular expressions). And finally: `"Ⓜ".scan(/[[:alpha:]]/)` works as I believe you are saying it should. – Phrogz Aug 10 '11 at 02:46
You’re right: spot checks show more things working than last I looked. Not sure what happened on the previous runs. Strange. – tchrist Aug 10 '11 at 03:02

score 1 · Answer 2 · answered Aug 09 '11 at 15:54

1

Try the Unicode character class \p{N} instead. That matches all Unicode digits. No idea why \d isn't working.

answered Aug 09 '11 at 15:54

Tim Pietzcker

328,213
58
503
561

2

Your `\pN` (really, `\p{GC=Number}`) matches more than all digits: it matches all Numbers, including non-digits. Just the digits are `\p{Nd}` (really, `\p{GC=Decimal_Number}` or `\p{Numeric_Type=Decimal}` in full Unicode). All numbers in Unicode have numeric values, but not all are digits. Digits are base-10 bigendian ones you can build larger numbers out of. DIGIT SEVEN, SUPERSCRIPT SEVEN, & ROMAN NUMERAL SEVEN are all `\pN` & `\p{NV=7}`. However, their general categories are `Decimal_Number`, `Other_Number`, `Letter_Number`. **Alas, Ruby supports at most *only* 3 Unicode properties!** – tchrist Aug 09 '11 at 19:07
1

Correction: Ruby does not support three Unicode properties; it supports only two Unicode properties, the `General_Category` and the `Script` property. Even those don’t work on all strings because of the Ruby strings-keelp-their-encoding design-bug. It has a few bogus properties that you can’t trust because they have the same names as those in the Unicode Standard, but which behave contrary ot the same. One of those is the `alphabetic` property, which is broken in Ruby. Sad sad sad. – tchrist Aug 09 '11 at 19:19

J-_-L · Answer 3 · 2019-11-11T14:32:54.480

\d will only match for ASCII numbers by default. You can manually turn on Unicode matching in a regex using the (counter-intuitive) (?u) syntax:

"".match(/(?u)\d/) # => #<MatchData "">

Alternatively, you can use "posix" or "unicode property" style in your regex, which don't require you to manually turn on Unicode matching:

/[[:digit:]]/ # posix style
/\p{Nd}/ # unicode property/category style

You can find more detailed information about how to do advanced matching for Unicode characters in Ruby in this blog post: https://idiosyncratic-ruby.com/30-regex-with-class.html

Scanning for Unicode Numbers in a string with \d

3 Answers3

Linked