1

I want to split the characters that are not in ASCII format i.e UTF-8. I wrote this line of code

result = string.scan(/[\x0600-\x06ff]/)

And somehow it is not working, and give error

"empty range in char class : /[\x0600-\x06ff]/".

I just want to check if a character falls in range of regular expression. If so, then split it out.

Sheridan
  • 68,826
  • 24
  • 143
  • 183
ZeeAzmat
  • 1,461
  • 3
  • 12
  • 11
  • I can't use it because i have to **check if character falls in given range or not**. If yes then split it. – ZeeAzmat Mar 03 '14 at 03:08
  • My task is not to check every non-ascii character. Range I want to check is given in question 0600-06ff – ZeeAzmat Mar 03 '14 at 03:09

2 Answers2

6

Ruby doesn't support Unicode tokens in it's implementation of regex (or my RegexBuddy is telling me lies)

If I try \u0000 I get an error that says it is not supported.
If your version of Ruby does support it, the range is [\u0000-\uFFFF]

You could try using the POSIX class [^[:ascii:]] to match everything non-ASCII.
You could also try [^\x00-\xFF] to match everything which does not have a decimal value of 0-255.

Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
3

Your problem is that you're using \x incorrectly. \xHH specifies the byte that is HH in hexadecimal. That means that \x0600 is actually \x06, 0, and 0 and your whole character class looks like (with spaces to separate the parts):

\x06 0 0-\x06 f f

and the only range is 0-\x06 and that's not what you want. If you want to specify Unicode by hex value, then you want to use \u:

/[\u0600-\u06ff]/

Furthermore, your range misses a lot of non-ASCII values (such as 'µ' which is \u00b5), you'd be better off using Vasili's /[^[:ascii:]]/ POSIX named character class or /[^\p{ASCII}]/.

Community
  • 1
  • 1
mu is too short
  • 426,620
  • 70
  • 833
  • 800
  • i want this to calculate count of unicode characters But this regular expression isn't working, it always goes in else part. Can you spot the problem. `if unicode.match(/[\u0600-\u06ff]/) unicodeChars += 1 else asciiChars += 1 end` – ZeeAzmat Mar 03 '14 at 03:29
  • What does `unicode` look like? Why not use Vasili's `[^[:ascii:]]` regex instead? Your 0x0600 to 0x06ff range misses a fair bit (such as `'µ'` which is `\u00b5`. – mu is too short Mar 03 '14 at 03:35
  • 1
    In above code **unicode has 4 digit hex number**, and the reason i am not using [^[:ascii:]] is i have to check how much of document is in **urdu** and how much of it is english. So in order to check i have to compare each character within given range. Beacuse there is alot of other unicode characters. And if i use [^[:ascii:]] as regular expression then other unicode characters other than urdu 'll also be making their existance in count. So **urdu** characters unicode range is from 0600-06ff – ZeeAzmat Mar 03 '14 at 03:46
  • I still don't know what exactly your `unicode` string looks like so I can't say any more. `"\u0600".match(/[\u0600-\u06ff]/)` works as expected for me. – mu is too short Mar 03 '14 at 04:28