Ruby: how to check if an UTF-8 string contains only letters and numbers?

Question

I have an UTF-8 string, which might be in any language.

How do I check, if it does not contain any non-alphanumeric characters?

I could not find such method in UnicodeUtils Ruby gem.

Examples:

ėččę91 - valid
$120D - invalid

Which version of Ruby? 1.8 has limited multi-byte capability. 1.9+ has it in spades. — the Tin Man, Jan 31 '11 at 22:50

score 3 · Accepted Answer · answered Jan 31 '11 at 23:46

3

You can use the POSIX notation for alpha-numerics:

#!/usr/bin/env ruby -w
# encoding: UTF-8

puts RUBY_VERSION

valid = "ėččę91"
invalid = "$120D"

puts valid[/[[:alnum:]]+/]
puts invalid[/[^[:alnum:]]+/]

Which outputs:

1.9.2
ėččę91
$

answered Jan 31 '11 at 23:46

the Tin Man

158,662
42
215
303

1

Is that the same as `[\p{Alphabetic}\p{Number}]`? – tchrist Feb 01 '11 at 00:13

Michael Papile · Answer 2 · 2011-02-01T00:16:44.563

1

In ruby regex \p{L} means any letter (in any glyph)

so if s represents your string:

 s.match /^[\p{L}\p{N}]+$/

This will filter out non numbers and letters.

edited Feb 01 '11 at 00:16

answered Jan 31 '11 at 23:47

Michael Papile

6,836
30
30

1

You have `\d` but `\d` is **not numbers!** `\pN` is numbers, or rubyspeak, the `\p{N}` verbosity. `\d` is only `\p{Decimal_Number}` a.k.a. `\p{Numeric_Type=Decimal}` Not that Ruby bothers to support all the Unicode properties like that, but anyway 1.9 is better than 1.8. Still a long ways to go, though. – tchrist Feb 01 '11 at 00:08
Thanks for that, I updated answer to be more precise with numbers. – Michael Papile Feb 01 '11 at 00:17
1

**Technically speaking,** there are just over 1,000 code points which are of type `\p{Alphabetic}` but which are not `\p{Letter}`. This especially matters if you haven’t normalized to NFC form, or have decompoposed to NFD or NFKD, but in fact can actually occur in even NFC forms, too. Just depends. – tchrist Feb 01 '11 at 00:23

score 1 · Answer 3 · answered Feb 01 '11 at 00:19

The pattern for one alphanumeric code point is

/[\p{Alphabetic}\p{Number}]/

From there it’s easy to extrapolate something like this for has a negative:

/[^\p{Alphabetic}\p{Number}]/

or this for is all positive:

 /^[\p{Alphabetic}\p{Number}]+$/

or sometimes this, depending:

/\A[\p{Alphabetic}\p{Number}]+\z/

Pick the one that best suits your needs.

Ruby: how to check if an UTF-8 string contains only letters and numbers?

3 Answers3