1

I want to match these two tokens:
1. NUM: A series of characters in [0-9_] with an optional . in between.
2. ID: A series of characters in [a-zA-Z0-9_] with at-least one [a-zA-Z] character.

Flex rules for these will be:

[0-9_]+([.][0-9_]+)?/[^a-zA-Z0-9_] return NUM;

[a-zA-Z0-9_]*[a-zA-Z][a-zA-Z0-9_]* return ID;

.
.
.

Note that, trailing context is required for NUM since "123.456ab" should match 123-NUM, .-OPER & 456ab-ID. Without the trailing context, it would match 123.456-NUM & ab-ID.

But now the problem is, it will not match NUM followed by EOF. So, how to match EOF in trailing context of flex rule?

TL;DR:
What I want: NUM not followed by [a-zA-Z0-9_].
What I'm getting currently: NUM followed by a character other than [a-zA-Z0-9_].
These two differ in at EOF.

EDIT: Just got to know that Re/Flex supports word boundaries. If I shift from using Flex to Re/Flex, is there any performance downsides? or any other things that I should be aware of?

Sourav Kannantha B
  • 2,860
  • 1
  • 11
  • 35

1 Answers1

2

The fact that you can't put EOF in trailing context is occasionally annoying but there's almost always a workaround, usually based on using maximal munch match ordering to ensure that some pattern matches at EOF because any other match would be longer. (Remember that trailing context counts for length comparison even though it's not part of the final token).

Here's one example:

[0-9_]+/[.][0-9_]*[a-zA-Z]    return NUM;
[0-9_]+([.][0-9_]+)?          return NUM;
[0-9_]*[a-zA-Z][a-zA-Z0-9_]*  return ID;

Pattern one matches digits followed by a decimal point if the decimal point is in turn followed by something which could be an ID.

Pattern two matches any number, regardless of what (if anything) follows.

Pattern three matches an ID (at least one letter). (It has the same effect as your second pattern. I just shortened the first character class; since * makes the prefix optional anyway, an ID with a leading letter can be matched directly by the rest of the pattern.)

We count on maximal munch to avoid pattern two matching prematurely. Numbers without decimal point followed by a letter will have a longer match at pattern three; numbers with a decimal point followed by a letter will have a longer match with pattern three. All that's left are numbers not followed by a letter; for those pattern two will apply.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Thanks. This answers my question perfectly. But as a quick follow on, is lexer generated by `Re/Flex` less performant than that by `Flex`?? – Sourav Kannantha B May 14 '22 at 20:43
  • 1
    @sourav: I don't know. The question is so unspecific that it's impossible to answer with confidence. Both programs can generate more than one lexer, and lexers can analyse more than one input. Flex, in particular, has many tuning options (too many, imho), some of which are better with certain rule sets ans certain inputs; others have different sweet spots. Careful and informed benchmarking might let you say which is better for your particular problem. But I've never attempted that, because it's far from the most important criterion for me. – rici May 14 '22 at 23:29
  • 1
    What's important? To start with, quality of documentation. Can I even find the documentation? (For a surprising number of parser generator tools, the answer is "no"). If I sit down and read the docs, will I feel confident that I can use the tool? (Not just write lexical rules, although obviously that's important, but also integrate it into my app and into my build system). – rici May 14 '22 at 23:39
  • 1
    Then, there's the feature set. Does it have the features I actually need? Does it have features which I could live without but might make my life easier? Does it generate lexers for the language I'm writing the rest of the tool in? – rici May 14 '22 at 23:42
  • 1
    How mature is it? Has it been around long enough and been used enough that it's likely that bugs have been fixed? Is the development team responsible, responsive, and aware of the importance of stable interfaces? ... Etc – rici May 14 '22 at 23:47
  • 1
    Any tool which passes all those tests is probably fast enough for a parsing application. But if I try it and it seems to be way too slow, I'll probably work down my list. – rici May 14 '22 at 23:50
  • 1
    @Sourav: in short, my advice: fond a tool you're comfortable with and learn how to use it well. Then maybe try a different tool, so you can compare them in your problem space. But aways put it into perspective: unless you have good evidence to the contrary, don't assume that small performance differences between different lexers will have a noticeable effect on your application. – rici May 14 '22 at 23:55