Fixing regex to work around ICU/RegexKitLite bug

Question

I'm using RegexKitLite, which in turn uses ICU as its engine. Despite the documentation, a regex like /x*/ when searching against "xxxxxxxxxxx" will match empty string. It is behaving like /x*?/ should. I would like to route around this bug when it's present, and I'm considering rewriting any unescaped * as + when a regex match returns a 0-length result. My naïve guess is that the regex with +s in placeof *s will always return a subset of the correct results. What are the unexpected consequences of this? Am I going the right way?

FWIW, ICU also offers a *+ operator, but it doesn't work either.

EDIT: I should have been clearer: this is for the search field of an interactive app. I have no control over the regex that the user enters. The broken * support appears to be a bug in ICU. I sure wish I didn't need to include that POS in my code, but it's the only game in town.

What version of ICU/RegexKitLite are you using? What part of the documentation would lead you to expect a different result? — Steven R. Loomis, Feb 14 '11 at 17:55
I tried ICU 4.2 on Linux and whatever ships with MacOS (3.6, I think). I expect * to be greedy because the ICU docs for the * operator say: "Match 0 or more times. Match as many times as possible." See page 112 of this pdf: http://icu-project.org/userguide/icu.pdf — George, Feb 15 '11 at 06:38
That PDF is very much out of date.. I'll remove it. http://userguide.icu-project.org/ is the current user guide. — Steven R. Loomis, Feb 15 '11 at 16:16
Similar comment on http://userguide.icu-project.org/strings/regexp#TOC-Regular-Expression-Operators however. Please do file a ticket, if you haven't yet. — Steven R. Loomis, Feb 15 '11 at 18:01

score 1 · Accepted Answer · answered Feb 13 '11 at 00:31

If you simply change every * quantifier to a +, the regex will fail to work in those instances where the * should have matched zero occurrences. In other words, the problem will have morphed from always matching zero to never matching zero. If you ask me, it's useless either way.

However, you might be able to handle the zero-occurrences case separately, with a negative lookahead. For example, x* could be rewritten as (?:(?!x)|x+). It's hideous I know, but it's the most self-contained fix I can envision at the moment. You would have to do this for possessive stars as well (*+), but not reluctant stars (*?).

Here it is in table form:

BEFORE       AFTER
x*           (?:(?!x)|x+)
x*+          (?:(?!x)|x++)
x*?          x*?

More complex atoms would need to have their own parentheses preserved:

(?:xyz)*     (?:(?!(?:xyz))|(?:xyz)+)

You could probably drop them inside the lookahead, but they don't hurt anything except readability, and that's a lost cause anyway. :D If the {min,} and {min,max} forms are affected too, they would get the same treatment (with the same modifications for possessive variants):

x{0,}        same as x*
x{0,n}       (?:(?!x)|x{1,n})

It occurs to me that conditionals--(?(condition)yes-pattern|no-pattern)--would be a perfect fit here; unfortunately, ICU doesn't seem to support them.

Sadly, this doesn't work either. I'm following up with ICU to file a bug report. If I have to stick with rewriting the regex, is there a case where rewriting * to + would cause the program to miss a non-zero-length match that it would otherwise have found? I rewrite only in the case that the entire match is zero-length. — George, Feb 13 '11 at 17:26
It's the interaction that kills you: `/x*x*/` should match `"x"`, but `/x+x+/` won't; `/.*+.*/s` should match any string, but `/.++.+/` will never match. For a not-completely-silly example, `/\w*?(\d*)/` should match a word, capturing any and all trailing digits in group #1. Change it to `/\w+?(\d+)/` and (as before) it doesn't match single-character words at all. Additionally, while the word `"123"` *will* be matched, only the `"23"` will be captured in group #1; the original regex **should** have captured `"123"` in group #1. — Alan Moore, Feb 13 '11 at 18:38
Please do file a bug report. Also, file a request for conditionals. Even better, ICU is open source- one could contribute support for conditionals. — Steven R. Loomis, Feb 14 '11 at 17:56

Andy Heninger · Answer 2 · 2011-02-24T00:54:50.477

1

I can't say where things may have gone wrong with the code in question, but I can say with confidence that this specific bug is not in the ICU library. (I'm the author of the ICU regular expression package.)

I agree with the sentiment expressed above, the thing to do is not to try to hack around the problem by tweaking the regexp pattern, but to understand what the underlying problem is. There's probably some simple mistake being made that isn't clear from the original question as posed.

edited Feb 24 '11 at 00:54

answered Feb 24 '11 at 00:46

Andy Heninger

11
2

I realize it's old, but it's a chance to write to the author of the software. I believe that this bug *is* in the ICU library, *if* I face the same problem. The REGEX function in LibreOffice is implemented using ICU [ https://opengrok.libreoffice.org/xref/core/sc/source/core/tool/interpr1.cxx?r=61f4250e&mo=310559&fi=9369#9369 ]; and it has a bug https://bugs.documentfoundation.org/show_bug.cgi?id=147875 where regex "[^;]*" against "1;2;3" will match three empty strings in addition to three numbers. – Mike Kaganski Mar 09 '22 at 16:14

score 0 · Answer 3 · answered Feb 12 '11 at 22:32

0

Both \* and [*] are literal asterisks, so a naive replacement mightn't work.

In fact, don't do dynamic rewriting, it's too complicated. Try to tweak your regexes statically first.

x* is equivalent to x{0,} and (?:x+)?.

answered Feb 12 '11 at 22:32

aaz

5,136
22
18

`(?:x+)?` is **not** equivalent to `x*`. It may match the same strings, but in cases where it *can't* match, it will cause a severe performance hit. – Alan Moore Feb 13 '11 at 00:37
@Alan - That's an implementation detail: a good compiler can compile both to the same thing. In this case, `x*` is incorrect, so it might as well be infinitely fast. – aaz Feb 13 '11 at 01:01
What you say may be true of text-directed regex flavors like GNU's `grep` and `awk`, but regex-directed flavors like those found in ICU and most popular programming languages do not guarantee that `(?:x+)?` and `x*` will show the same performance characteristics. But even if ICU did make that guarantee, would you really be willing to trust it? ;) – Alan Moore Feb 13 '11 at 01:40

score 0 · Answer 4 · answered Feb 12 '11 at 23:48

Yeah, use that strategy:
(pseudo code)

if ($str =~ /x*/ && $str =~ /(x+)/) { print "'$1'\n"; }

But the real problem is the BUG as you say. Why on earth is the basic construct of quantifiers screwed up? This is not a module you should include in your code.

Fixing regex to work around ICU/RegexKitLite bug

4 Answers4

Linked