I'm writing a parser for RDF data in Turtle format in Swift. The Turtle Grammar defines the patter PN_CHARS_BASE
as
[163s] PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
(see the W3C Turtle recommendation).
The last group in the pattern [#x10000-#xEFFFF]
is outside the range of UTF-16 string encoding. UTF-32 is needed here.
This pattern is used to match, for instance, the first character in the prefix in a prefixed string such as foaf
in foaf:name
, numbers are not allowed here.
I would like to use NSRegularExpression
for parsing turtle files. So to match the PN_CHARS_BASE
pattern, I have the following code for testing:
let PN_CHARS_BASE = "[A-Z]|[a-z]|[\\u00C0-\\u00D6]|[\\u00D8-\\u00F6]|[\\u00F8-\\u02FF]|[\\u0370-\\u037D]|[\\u037F-\\u1FFF]|[\\u200C-\\u200D]|[\\u2070-\\u218F]|[\\u2C00-\\u2FEF]|[\\u3001-\\uD7FF]|[\\uF900-\\uFDCF]|[\\uFDF0-\\uFFFD]|[\\u10000-\\uEFFFF]"
do {
let teststr = "9"
let regex = try NSRegularExpression(pattern: PN_CHARS_BASE, options: [])
let matches = regex.matchesInString(teststr, options: [], range: NSMakeRange(0, teststr.characters.count)) as Array<NSTextCheckingResult>
} catch {
}
When I run this through the debugger, the regular expression returns one result on the test string 9
. But numbers are not allowed for this pattern (the regex should therefore return no matches). I removed parts of the regex pattern to determine which part of the regex matched the number 9
and found out that the last part of the regex [\u10000-\uEFFFF]
matches with 9
. This is the only part of the pattern that is in UTF-32 and not in UTF-16 and includes characters such as Egyptian Hieroglyphs.
Do you know if NSRegularExpression
is able to support UTF-32 characters?
Or of any other solution to support UTF-32 matching?