4

I am facing a problem compiling this regular expression with flex

"on"[ \t\r]*[.\n]{0,300}"."[ \t\r]*[.\n]{0,300}"from"    {counter++;}

I had 100 hundred rules in rules section of flex specification file. I tried to compile it flex -Ce -Ca rule.flex I waited for 10 hours still it didn't complete so I killed it. I started to find the issue and narrowed down the problem to this rule. If I remove this rule from 100 rules, it takes 21 seconds to compile it to C code.

If I replace the period with some other character it compiles successfully. e.g.

"on"[ \t\r]*[.\n]{0,300}"A"[ \t\r]*[.\n]{0,300}"from"    {counter++;} 

compiles in no time. Even a period followed/preceded by a space character compiles quickly

"on"[ \t\r]*[.\n]{0,300}" ."[ \t\r]*[.\n]{0,300}"from"    {counter++;}

I can see from flex manual that "." matches literal "."

What is wrong with my rule?

rici
  • 234,347
  • 28
  • 237
  • 341
Aryaveer
  • 943
  • 1
  • 12
  • 27
  • Don't use lexical analysers for parsing. Use them to separate the tokens from each other, and write yourself a separate parser, either by hand or with a parser generator such as yacc/bison. You'll never get there from here. – user207421 Mar 08 '16 at 11:12
  • @EJP I am not using flex for parsing. I have thousands of regular expressions, each of them representing a template of text message. I was using Java's regex to match templates. But that is unacceptably slow. So I compiled all regex into a DFA using flex. It is ultra fast. All I want to know is whether given text message matches any of the regex. – Aryaveer Mar 08 '16 at 11:20
  • You *are* using *flex* for parsing. Period. – user207421 Mar 08 '16 at 11:54
  • @EJP regardless, I think that is a legitimate rule. If you think otherwise I'll be happy to correct myself :) – Aryaveer Mar 08 '16 at 12:06

1 Answers1

3

The simple answer is that [.\n] probably doesn't do what you think it does. Inside a character class, most metacharacters lose their special meaning, so that character class contains only two characters: a literal . and a newline. You should use (.|\n).

But that won't solve the problem.

The underlying cause is the use of a fixed repetition count. Large (or even not so large) repetition counts can result in exponential blow-up of the state machine, if the end of the matched region is ambiguous.

With the repetition of [.\n], the repeated match has an unambiguous termination unless the rest of the regex can start with a dot or a newline. So "." triggers the problem, but "A" doesn't. If you correct the repetition to match any character, then any following character will trigger exponential blow-up. So if you make the change suggested above, the regular expression will continue to be uncompilable.

Changing the repetition count to an indefinite repetition (the star operator) would avoid the problem.


To illustrate the problem, I used the -v option to check the number of states with different repetition counts. This clearly shows the exponential increase in state count, and it's obvious that going much further than 14 repetitions would be impossible. (I didn't show the time consumption; suffice it to say that flex's algorithms are not linear in the size of the DFA, so while each additional repetition doubles the number of states, it roughly quadruples the time consumption; at 16 states, flex took 45 seconds, so it's reasonable to assume that it would take about a week to do 23 repetitions, provided that the 6GB of RAM it would need was available without too much swapping. I didn't try the experiment.)

$ cat badre.l
%%
"on"[ \t\r]*[.\n]{0,XXX}"."[ \t\r]*[.\n]{0,XXX}"from"
$ for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14; do
>   printf '{0,%d}:\t%24s\n' $i \
>      "$(flex -v -o /dev/null <( sed "s/XXX/$i/g" badre.l) |&
>         grep -o '.*DFA states')"
> done
{0,1}:        17/1000 DFA states
{0,2}:        25/1000 DFA states
{0,3}:        41/1000 DFA states
{0,4}:        73/1000 DFA states
{0,5}:       137/1000 DFA states
{0,6}:       265/1000 DFA states
{0,7}:       521/1000 DFA states
{0,8}:      1033/2000 DFA states
{0,9}:      2057/3000 DFA states
{0,10}:     4105/6000 DFA states
{0,11}:    8201/11000 DFA states
{0,12}:   16393/21000 DFA states
{0,13}:   32777/41000 DFA states
{0,14}:   65545/82000 DFA states

Changing the regex to use (.|\n) for both repetitions roughly triples the number of states, because with that change both repetitions become ambiguous (and there is an interaction between the two of them).

rici
  • 234,347
  • 28
  • 237
  • 341
  • I understand what you said but It is compiling with [.\n] The problem is "." It compiles when I replace this character with some other character or I put a space before period symbol i.e. " ." My question is why is this happening? – Aryaveer Mar 08 '16 at 14:30
  • @aryaveer: i explain that in the next paragraph. If the repeated pattern can match the first character of what follows, you get exponential blowup with fixed repetition counts. So the `"."` is a problem because `[.\n]` matches it. It doesn't match `A` or space, so no problem in those cases. – rici Mar 08 '16 at 14:46
  • @aryaveer: Now with an illustration of the exponential state blow-up, showing the number of states needed for various (small) repetition counts. – rici Mar 08 '16 at 21:25
  • You are right. Is there any way to match a pattern of this type "on"[ \t\r]*(.){0,300}"A"[ \t\r]*(.){0,300}"from" {counter++;} Can we use a parser or some other tool? – Aryaveer Mar 09 '16 at 06:43
  • @aryaveer, anything is parseable if you can describe what it is you want to parse. But "a pattern of this type..." doesn't tell me anything. What exactly do you want to match? I seriously doubt that what you are looking for is "the longest string consisting of the word 'on' followed by at most 600 almost arbitrary characters (possibly containing the word 'from') and then the word 'from', where there is an 'A' no more than 300 characters from either end of the separating characters." If that *is* what you want, then you could avoid the fixed repetition by cutting the string first... – rici Mar 09 '16 at 20:03
  • ... at 610 characters. That would let you use indefinite repetition. But, as I said, I don't believe that is really the pattern you want to match, or at least it doesn't immediately strike me as a useful pattern to match. More likely is that you want to match the *first* 'from', and not look more than 600 characters to find it. That's also easily parseable. In neither case is flex really the tool of choice, though. – rici Mar 09 '16 at 20:05
  • here is a one concrete example "Dear"[ \t\r]*"Customer,"[ \t\r]*"Your"[ \t\r]*"package"[ \t\r]*(.){0,80}[ \t\r]*"is"[ \t\r]*"out"[ \t\r]*"for"[ \t\r]*"delivery"[ \t\r]*"via"(.){0,80}[ \t\r]*"Courier,"[ \t\r]*(.){0,80}[ \t\r]*"on"(.){0,80}"."[ \t\r]*"Delivery"[ \t\r]*"will"[ \t\r]*"be"[ \t\r]*"attempted"[ \t\r]*"in"[ \t\r]*"5"[ \t\r]*"wkg"[ \t\r]*"days." I asked this question here http://stackoverflow.com/questions/35900614/efficient-matching-of-text-messages-against-thousands-of-regular-expressions – Aryaveer Mar 09 '16 at 20:20
  • @Aryaveer: Do you really mean `[ \t\r]`? I would have thought `[ \t\n\r]` or maybe `[[:space:]]`... Anyway, that makes a bit more sense, but why are you limiting the gaps to 80 characters? Is the string you are matching against more than one message? (In that case, I'd suggest breaking the input into separate messages and matching each one separately.) Perhaps you want non-greedy matching. Flex doesn't implement that, but you could use Russ Cox's `re2` regex library which is guaranteed not to backtrack unless you actually use back-references. – rici Mar 09 '16 at 20:25
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/105843/discussion-between-aryaveer-and-rici). – Aryaveer Mar 09 '16 at 20:27