RegexpError in Ruby when parsing \p{IsBasicLatin} character property

Question

I'm using JRuby 1.7.18 and have even tried this in JRuby 9000 (latest version) where I get the same error. I'm using the soap-4r and nokogiri libraries to parse a wsdl xml file.

When the below part of the wsdl is parsed

<xs:pattern value="[\p{IsBasicLatin}]*"/>

I get the following error

RegexpError: (RegexpError) invalid character property name <IsBasicLatin>: /\A[\p{IsBasicLatin}]*\z/n
nokogiri/XmlSaxParserContext.java:252:in `parse_with'
nokogiri/XmlSaxParserContext.java:252:in `parse_with'
nokogiri/XmlSaxParserContext.java:252:in `parse_with'

In Ruby 1.9, which is one of the Ruby versions that JRuby 1.7.18 is compatible with, I read that character blocks like \p{IsBasicLatin} are not supported. But scripts like \p{Latin} are supported. I've tried changing IsBasicLatin to Latin and even tried a few other ones like InBasicLatin and InBasic_Latin but they all return the same error.

This is both in JRuby 1.7.18 and JRuby 9000 which is the latest version.

What is going wrong here and how can I fix it?

It's `In_Basic_Latin`. Maybe the regexp's encoding isn't Unicode (`u` modifier for UTF-8 but that should be the default) or JRuby 9000 doesn't support character properties (at least those) yet? — cremno, Aug 10 '15 at 18:59
@cremno I tried `In_Basic_Latin` and got the same error. Also, I thought it was the Unicode problem too, so I've searched through the nokogiri source code and it's encoding is definitely getting set to `UTF-8`. It's hardcoded in there. Unless its somehow getting lost when it goes over to Java. If you notice in my error log above the source files are actually java classes. — Graham, Aug 10 '15 at 19:04
Report it to the JRuby team. `IsBasicLatin` is wrong (in Ruby) but `In_Basic_Latin` should work. Editing the file to fix the error is okay, isn't it? — cremno, Aug 10 '15 at 19:18
@cremno yes that's what I mean when I said I tried `In_Basic_Latin`. I edited the wsdl file so it used that instead of `IsBasicLatin`. I guess I'll have to get in touch with the JRuby developers. — Graham, Aug 10 '15 at 20:05

score 0 · Accepted Answer · answered Aug 10 '15 at 20:48

As mentioned in the comments the name of the character property is actually In_Basic_Latin and not IsBasicLatin. Modern versions of Ruby (MRI or CRuby to be specific) use the regular expression library Onigmo. The official Ruby docs don't list all Unicode properties but luckily Onigmo does.

Apparently JRuby doesn't seem to implement (at least) the Unicode block ones. However information (name and range) about blocks are publicly accessible. \p{In_Basic_Latin} is therefore equivalent to [\u0000-\u007F]. So is [[:ascii:]].

Changing it from [\p{In_Basic_Latin}] to [\u0000-\u007F] worked, thank you! — Graham, Aug 10 '15 at 21:07

RegexpError in Ruby when parsing \p{IsBasicLatin} character property

1 Answers1