0

I'm writing a parser for RDF data in Turtle format in Swift. The Turtle Grammar defines the patter PN_CHARS_BASE as

[163s]  PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

(see the W3C Turtle recommendation).

The last group in the pattern [#x10000-#xEFFFF] is outside the range of UTF-16 string encoding. UTF-32 is needed here.

This pattern is used to match, for instance, the first character in the prefix in a prefixed string such as foaf in foaf:name, numbers are not allowed here.

I would like to use NSRegularExpression for parsing turtle files. So to match the PN_CHARS_BASE pattern, I have the following code for testing:

    let PN_CHARS_BASE = "[A-Z]|[a-z]|[\\u00C0-\\u00D6]|[\\u00D8-\\u00F6]|[\\u00F8-\\u02FF]|[\\u0370-\\u037D]|[\\u037F-\\u1FFF]|[\\u200C-\\u200D]|[\\u2070-\\u218F]|[\\u2C00-\\u2FEF]|[\\u3001-\\uD7FF]|[\\uF900-\\uFDCF]|[\\uFDF0-\\uFFFD]|[\\u10000-\\uEFFFF]"
    do {
        let teststr = "9"
        let regex = try NSRegularExpression(pattern: PN_CHARS_BASE, options: [])
        let matches = regex.matchesInString(teststr, options: [], range: NSMakeRange(0, teststr.characters.count)) as Array<NSTextCheckingResult>
    } catch {

    }

When I run this through the debugger, the regular expression returns one result on the test string 9. But numbers are not allowed for this pattern (the regex should therefore return no matches). I removed parts of the regex pattern to determine which part of the regex matched the number 9 and found out that the last part of the regex [\u10000-\uEFFFF] matches with 9. This is the only part of the pattern that is in UTF-32 and not in UTF-16 and includes characters such as Egyptian Hieroglyphs.

Do you know if NSRegularExpression is able to support UTF-32 characters? Or of any other solution to support UTF-32 matching?

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Dieudonné
  • 543
  • 1
  • 4
  • 16

1 Answers1

6

I just found the answer myself. I needed to specify the UTF-32 characters in a different format than the UTF-16 characters.

Not [\u10000-\uEFFFF] but [\U00010000-\U000EFFFF] is needed to express the full range of unicode characters. The UTF-32 Unicode code point starts with an escaped CAPITAL \U and needs exactly 8 hex digits.

Dieudonné
  • 543
  • 1
  • 4
  • 16