7

I am trying an XQuery using fn:matches with a regular expression, but the MarkLogic implementation of XQuery does not seem to allow hexidecimal character representations. The following gives me an "Invalid regular expression" error.

(: Find text containing non-ISO-Latin characters :)
let $regex := '[^\x00-\xFF]'
let $results := fn:collection('mydocs')//myns:myelem[fn:matches(., $regex)]
let $count := fn:count($results)

return
    <figures count="{$count}">
        { $results }
    </figures>

However, this one does not give the error.

let $regex := '[^a-zA-Z0-9]'
let $results := fn:collection('mydocs')//myns:myelem[fn:matches(., $regex)]
let $count := fn:count($results)

return
    <figures count="{$count}">
        { $results }
    </figures>

Is there a way to use the hexidecimal character representation, or an alternative that would give me the same result, in MarkLogic's implementation of XQuery?

Sofia
  • 771
  • 1
  • 8
  • 22
kalinma
  • 486
  • 5
  • 16
  • Can you try the following code and let us know if it runs without error: `let $regex := '[^\x00\xFF]'` If it runs, it means you have a problem with the range. If it doesn't run, then MarkLogic regex will appear to not accept hexadecimal matches. – Tim Biegeleisen May 01 '15 at 19:20
  • Thanks. It does indeed run: let $regex := '[^\x00-\xFF]' return $regex does not return an error – kalinma May 01 '15 at 19:26
  • The problem is the hex characters in a range then. Every regex engine has different escaping rules when you're using a character set (i.e. sometime engines require `\[a-z\]` others might need `[\x{00}]`. It'll be hard to test without an actual MarkLogic console in front of me. – Tim Biegeleisen May 01 '15 at 19:29
  • Can you use the `[[:ascii:]]` class in MarkLogic regex? In your first example, you are essentially trying to match _any_ ASCII character. – Tim Biegeleisen May 01 '15 at 19:43

2 Answers2

7

XQuery can use numeric character references in strings, in much the same way that XML and HTML can:

decimal: "&#10;" hex: "&#0a;" (or just "&#a;")

However, you can't represent some characters: <= "&#x09;", for instance.

There's no regex type in XQuery (you just use a string as a regex), so you can use character references in your regular expressions:

fn:matches("a", "[^&#x09;-&#xFF;]")

(: => xs:boolean("false") :)

Update: here's the XQuery 1.0 spec on character references: http://www.w3.org/TR/xquery/#dt-character-reference.

Based on some brief testing, I think MarkLogic enforces XML 1.1 character reference rules: http://www.w3.org/TR/xml11/#charsets

For posterity, here are the XML 1.0 rules: http://www.w3.org/TR/REC-xml/#charsets

joemfb
  • 3,056
  • 20
  • 19
2

Well, it seems MarkLogic's implementation of xQuery wants Unicode. As it turned out, even very small ranges in hex(e.g., [^x00-x0F]) threw the "Invalid regular expression" error, but Unicode notation did not throw the error. The following give me results.

let $regex := '[^U0000-U00FF]'
let $results := fn:collection('mydocs')//myns:myelem[fn:matches(., $regex)]
let $count := fn:count($results)

return
    <figures count="{$count}">
        { $results }
    </figures>

I think that the mere assignment of let $regex := '[^\x00-\xFF]' did not throw the error because it was treated as a string when I tried return $regex.

kalinma
  • 486
  • 5
  • 16
  • That regex is not matching unicode characters by hexadecimal codepoint; it's matching anything but `U00`, `0-U`, and `00FF` (ie, those ranges are interpreted as literal characters). – joemfb May 01 '15 at 20:28