1

My goal is to analyze a postal code and to identify the separate parts using a regular expression and the analyze-string function.

I use MarkLogic 10. Using the regex to match validates the example below correctly. However, when I use it to analyze the string it fails to identify the various groups correctly:

(: analyze dutch postal code :)
let $regex := "^[1-9]\d{3}([A-Z]{2}(\d+(\S+)?)?)?$"
return fn:analyze-string("1234AA11bis", $regex)

it returns the following :

<s:analyze-string-result xmlns:s="http://www.w3.org/2005/xpath-functions">
<s:match>1234<s:group nr="1">AA<s:group nr="2">1<s:group nr="3">1bis</s:group></s:group></s:group>
</s:match>
</s:analyze-string-result>

I expect it to return '11' as the value of group nr 2 and 'bis' as the result of group nr 3.

I used some online regex analyzers that return the correct result. Am I missing some flag or something or is this just a bug in MarkLogic?

Marcel de Kleine
  • 146
  • 1
  • 11

1 Answers1

0

I am not sure what the specs have to say about nested greedy patterns, but there is an easy fix:

let $regex := "^[1-9]\d{3}([A-Z]{2}(\d+([^\d\s]+)?)?)?$"
return fn:analyze-string("1234AA11bis", $regex)

HTH!

grtjn
  • 20,254
  • 1
  • 24
  • 35
  • Thanks for the fix, that helps. The regex mentioned is an official one provided by the government. Still wondering why the output of ML differs from other engines. – Marcel de Kleine Sep 30 '20 at 14:11
  • Indeed, it's always best to make the regex unambiguous, rather than relying on greediness or non-greediness. – Michael Kay Sep 30 '20 at 14:11