Regular expression for range checking for non ASCII characters in Ruby

Question

I want to split the characters that are not in ASCII format i.e UTF-8. I wrote this line of code

result = string.scan(/[\x0600-\x06ff]/)

And somehow it is not working, and give error

"empty range in char class : /[\x0600-\x06ff]/".

I just want to check if a character falls in range of regular expression. If so, then split it out.

I can't use it because i have to **check if character falls in given range or not**. If yes then split it. — ZeeAzmat, Mar 03 '14 at 03:08
My task is not to check every non-ascii character. Range I want to check is given in question 0600-06ff — ZeeAzmat, Mar 03 '14 at 03:09

score 6 · Answer 1 · answered Mar 03 '14 at 03:07

6

Ruby doesn't support Unicode tokens in it's implementation of regex ^{(or my RegexBuddy is telling me lies)}

If I try \u0000 I get an error that says it is not supported.
If your version of Ruby does support it, the range is [\u0000-\uFFFF]

You could try using the POSIX class [^[:ascii:]] to match everything non-ASCII.
You could also try [^\x00-\xFF] to match everything which does not have a decimal value of 0-255.

answered Mar 03 '14 at 03:07

Vasili Syrakis

9,321
1
39
56

1

`\uHHHH` is used to specify Unicode by hex value but +1 for `:ascii:`. – mu is too short Mar 03 '14 at 03:08

score 3 · Answer 2 · edited May 23 '17 at 11:57

3

Your problem is that you're using \x incorrectly. \xHH specifies the byte that is HH in hexadecimal. That means that \x0600 is actually \x06, 0, and 0 and your whole character class looks like (with spaces to separate the parts):

\x06 0 0-\x06 f f

and the only range is 0-\x06 and that's not what you want. If you want to specify Unicode by hex value, then you want to use \u:

/[\u0600-\u06ff]/

Furthermore, your range misses a lot of non-ASCII values (such as 'µ' which is \u00b5), you'd be better off using Vasili's /[^[:ascii:]]/ POSIX named character class or /[^\p{ASCII}]/.

edited May 23 '17 at 11:57

Community

1
1

answered Mar 03 '14 at 03:07

mu is too short

426,620
70
833
800

i want this to calculate count of unicode characters But this regular expression isn't working, it always goes in else part. Can you spot the problem. `if unicode.match(/[\u0600-\u06ff]/) unicodeChars += 1 else asciiChars += 1 end` – ZeeAzmat Mar 03 '14 at 03:29
What does `unicode` look like? Why not use Vasili's `[^[:ascii:]]` regex instead? Your 0x0600 to 0x06ff range misses a fair bit (such as `'µ'` which is `\u00b5`. – mu is too short Mar 03 '14 at 03:35
1

In above code **unicode has 4 digit hex number**, and the reason i am not using [^[:ascii:]] is i have to check how much of document is in **urdu** and how much of it is english. So in order to check i have to compare each character within given range. Beacuse there is alot of other unicode characters. And if i use [^[:ascii:]] as regular expression then other unicode characters other than urdu 'll also be making their existance in count. So **urdu** characters unicode range is from 0600-06ff – ZeeAzmat Mar 03 '14 at 03:46
I still don't know what exactly your `unicode` string looks like so I can't say any more. `"\u0600".match(/[\u0600-\u06ff]/)` works as expected for me. – mu is too short Mar 03 '14 at 04:28

Regular expression for range checking for non ASCII characters in Ruby

2 Answers2