1

I have an UTF-8 string, which might be in any language.

How do I check, if it does not contain any non-alphanumeric characters?

I could not find such method in UnicodeUtils Ruby gem.

Examples:

  1. ėččę91 - valid
  2. $120D - invalid
tchrist
  • 78,834
  • 30
  • 123
  • 180
krn
  • 6,715
  • 14
  • 59
  • 82

3 Answers3

3

You can use the POSIX notation for alpha-numerics:

#!/usr/bin/env ruby -w
# encoding: UTF-8

puts RUBY_VERSION

valid = "ėččę91"
invalid = "$120D"

puts valid[/[[:alnum:]]+/]
puts invalid[/[^[:alnum:]]+/]

Which outputs:

1.9.2
ėččę91
$
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

In ruby regex \p{L} means any letter (in any glyph)

so if s represents your string:

 s.match /^[\p{L}\p{N}]+$/

This will filter out non numbers and letters.

Michael Papile
  • 6,836
  • 30
  • 30
  • 1
    You have `\d` but `\d` is **not numbers!** `\pN` is numbers, or rubyspeak, the `\p{N}` verbosity. `\d` is only `\p{Decimal_Number}` a.k.a. `\p{Numeric_Type=Decimal}` Not that Ruby bothers to support all the Unicode properties like that, but anyway 1.9 is better than 1.8. Still a long ways to go, though. – tchrist Feb 01 '11 at 00:08
  • Thanks for that, I updated answer to be more precise with numbers. – Michael Papile Feb 01 '11 at 00:17
  • 1
    **Technically speaking,** there are just over 1,000 code points which are of type `\p{Alphabetic}` but which are not `\p{Letter}`. This especially matters if you haven’t normalized to NFC form, or have decompoposed to NFD or NFKD, but in fact can actually occur in even NFC forms, too. Just depends. – tchrist Feb 01 '11 at 00:23
1

The pattern for one alphanumeric code point is

/[\p{Alphabetic}\p{Number}]/

From there it’s easy to extrapolate something like this for has a negative:

/[^\p{Alphabetic}\p{Number}]/

or this for is all positive:

 /^[\p{Alphabetic}\p{Number}]+$/

or sometimes this, depending:

/\A[\p{Alphabetic}\p{Number}]+\z/

Pick the one that best suits your needs.

tchrist
  • 78,834
  • 30
  • 123
  • 180