Check if a regex is ambiguous

Question

I wonder if there is a way to check the ambiguity of a regular expression automatically. A regex is considered ambiguous if there is an string which can be matched by more that one ways from the regex. For example, given a regex R = (ab)*(a|b)*, we can detect that R is an ambiguous regex since there are two ways to match string ab from R.

UPDATE

The question is about how to check if the regex is ambiguous by definition. I know in practical implementation of regex mechanism, there is always one way to match a regex, but please read and think about this question in academic way.

Umh, its not clear how I will use it but I believe that there will be several applications. I'm still working on that so I will come up with more detailed answer later. — Loi.Luu, Dec 16 '13 at 06:45
Does the automatic test need to come up with the target string `ab` by itself, or will that be handed in? — woolstar, Dec 16 '13 at 06:46
The input will be a regex, `ab` in my question is just an example. — Loi.Luu, Dec 16 '13 at 06:50
Interesting question. I think you should expand the question a little. First, you should explain what you mean by "ambiguous" because that can be interpreted in many ways. Next, are you talking about academic regular languages, or modern regex implementations? Those are not the same. — Kobi, Dec 16 '13 at 07:13
How could the second capture group ever match the string "ab" when the first capture group will have already matched it? Every regex implementation I'm aware of works from left to right, and there are *not* two ways to match string "ab" from R in this case. I don't get how this is ambiguous. What regex implementation are you working with? — Dagg Nabbit, Dec 16 '13 at 07:30
@DaggNabbit This is academic regex and I think we are looking the question from different perspectives. The main point of the question is to check if the regex is ambiguous or not, by definition. I know to modern regex implementation if a string match a regex, there is always one way to match. — Loi.Luu, Dec 16 '13 at 07:32
Vaguely thinking diagonal proof / undecidable in finite time / halting problem. — tripleee, Dec 16 '13 at 07:32
I don't have the skills to work on anything like a proof, but for any algorithm that comes up with possibly ambiguous strings to test with, how can you know that the next, slightly longer string it comes up with is not going to be ambiguous? — tripleee, Dec 16 '13 at 07:52
@tripleee That was my first instict as well. But regular expressions are far from turing complete. — Taemyr, Dec 16 '13 at 09:36
Know my answer will be not very helpful, but nobody has mentioned it yet. Any serious response would need automata theory or some rigorous mathematical proof. Maybe in another site -even in stackexchange- you would get better answers. Your question is more about maths or computer science than about programming. — durum, Dec 16 '13 at 10:10

score 5 · Answer 1 · answered May 18 '15 at 08:17

A regular expression is one-ambiguous if and only if the corresponding Glushkov automaton is not deterministic. This can be done in linear time now. Here's a link. BTW, deterministic regular expressions have been investigated also under the name of one-unambiguity.

woolstar · Answer 2 · 2013-12-16T16:37:34.553

3

You are forgetting greed. Usually one section gets first dibs because it is a greedy match, and so there is no ambiguity.

If instead you are talking about a mythical pattern matching engine without the practical details like greed; then the answer is yes you can.

Take every element of the pattern. And try every possible subset against every possible string. If more than one subset matches the same pattern then there's an ambiguity. Optimizing this to take less than infinite time is left as an exercise for the reader.

edited Dec 16 '13 at 16:37

answered Dec 16 '13 at 06:24

woolstar

5,063
20
31

5

This is more a comment than an answer ;) – brandonscript Dec 16 '13 at 06:25
7

r3mus. No its not! Given that both alternatives are equally greedy then "ab" will always match with the first expression. Therefore the result is predictable in every case and there is no ambiguity. – James Anderson Dec 16 '13 at 06:30
2

Greediness is a property of particular regex engines, not regexes in the abstract. The question doesn't specify how it's interpreted; there aren't any language tags, for example. – chrylis -cautiouslyoptimistic- Dec 16 '13 at 06:37
**Question**: "I wonder if there is a way to check the ambiguity of a regular expression automatically". This doesn't answer that, this just points out a characteristic that would need to be considered in such an endeavor. – brandonscript Dec 16 '13 at 06:40
Well, I think **greed** depends largely on the Regex mechanism. The thing here is if we have the DFA representation of `R` then there are two paths to match `ab`, right? – Loi.Luu Dec 16 '13 at 06:41
2

What about when you cannot match? The engine still has to try all possibilities: [Slow Regex performance](http://stackoverflow.com/q/9687596/7586). Either way, the pattern in the question is just an example. If you're saying ambiguity is never a problem due to "greed", you are simply wrong. I usually try, when it is possible, to let my patterns only match in one way, and fail quickly when they can't. – Kobi Dec 16 '13 at 07:03
1

How about the pattern `(ab)*(a|b){100}`? What does "greed" does for you there? – Kobi Dec 16 '13 at 07:11
@Kobi I'm sure you know the answer; backtracks to the point where success can be obtained, which would be when the right-hand side contains 100 matches, and the left hand, zero. But I agree in principle that the general answer would have to ignore greed, or rather, merely include it as one of several possible behaviors. – tripleee Dec 16 '13 at 07:29
And try every possible subset against every possible string. This will not terminate. – Taemyr Dec 16 '13 at 09:02

score 2 · Answer 3 · edited Nov 01 '14 at 08:03

I read a paper published around 1980 which showed that whether a regular expression is ambiguous can be determined in O(n^4) time. I wish I could give you a reference but I no longer know the reference or even the journal. A more expensive way to determine if a regular expression is ambiguous is to construct a finite state machine (exponential in time and space in worst case) from the regular expression using subset construction. Now consider any state X of the FSM constructed from nfa states N. If, for any two nfa states n1, n2 of X, follow(n1) intersect follow(n2) is not empty then the regular expression is ambiguous. If this is not true for any state of the FSM then the regular expression is not ambiguous.

score 1 · Answer 4 · answered Dec 16 '13 at 09:32

A possible solution:

Construct an NFA for the regexp. Then analyse the NFA where you start with a set of states consisting solely of the initial state. Then do a depth, or width first traversal where you keep track of if you can be in multiple states. You also need to track the path taken in order to eliminate cycles.

For example your (ab)*(a|b)* can be modeled with three states.

 |   a   |   b
p| {q,r} |  {r}
q|  {}   |  {p}
r|  {r}  |  {r}

Where p is the starting state and p and r accepts.

You then need to consider both letters and proceed with the sets {q,r} and {r}. The set {r} only leads to {r} giving a cycle and we can close that path. The set {q,r}, from {q,r} a takes us to {r} which is an accepting state, but since this path can not accept if we start with going to q we only have a single path here, we can then close this when we identify the cycle. Getting a b from {q,r} takes us to {p,r}. Since both of these accepts we have identified an ambigous position and we can conclude that the regexp is ambigous.

why do we have {q, r} at the (p, a) cell in the table? I thought It should be {q} only right? And why don't we have the $\epsilon$ character ? — Loi.Luu, Dec 16 '13 at 12:31
In the p state you are either at the beginning of the match or you have just finished matching an instance of `(ab)`. In this state, when you see an a you can either start a new instance of `(ab)`, which takes you to state q, or you can start an instance of `(a|b)`, which takes you to state r. Posix standard regexp are not anchored, so the match need not end at the end of the string, which is why I do not include epsilon. — Taemyr, Dec 16 '13 at 12:38
Thanks, I get your point. But I still can't generalize it to solve the general problem... — Loi.Luu, Dec 16 '13 at 12:57

Check if a regex is ambiguous

4 Answers4