Force CL-Lex to read whole word

Question

I'm using CL-Lex to implement a lexer (as input for CL-YACC) and my language has several keywords such as "let" and "in". However, while the lexer recognizes such keywords, it does too much. When it finds words such as "init", it returns the first token as IN, while it should return a "CONST" token for the "init" word.

This is a simple version of the lexer:

(define-string-lexer lexer
     (...)
     ("in"   (return (values :in $@)))
     ("[a-z]([a-z]|[A-Z]|\_)"  (return (values :const $@))))

How do I force the lexer to fully read the whole word until some whitespace appears?

score 8 · Answer 1 · edited Nov 21 '12 at 04:56

This is both a correction of Kaz's errors, and a vote of confidence for the OP.

In his original response, Kaz states the order of Unix lex precedence exactly backward. From the lex documentation:

Lex can handle ambiguous specifications. When more than one expression can match the current input, Lex chooses as follows:

The longest match is preferred.

Among rules which matched the same number of characters, the rule given first is preferred.

In addition, Kaz is wrong to criticize the OP's solution of using Perl-regex word-boundary matching. As it happens, you are allowed (free of tormenting guilt) to match words in any way that your lexer generator will support. CL-LEX uses Perl regexes, which use \b as a convenient syntax for the more cumbersome lex approximate of :

%{
#include <stdio.h>
%}

WC      [A-Za-z']
NW      [^A-Za-z']

%start      INW NIW

{WC}  { BEGIN INW; REJECT; }
{NW}  { BEGIN NIW; REJECT; }

<INW>a { printf("'a' in wordn"); }
<NIW>a { printf("'a' not in wordn"); }

All things being equal, finding a way to unambiguously match his words is probably better than the alternative.

Despite Kaz wanting to slap him, the OP has answered his own question correctly, coming up with a solution that takes advantage of the flexibility of his chosen lexer generator.

score 1 · Answer 2 · answered Mar 28 '12 at 05:16

Your example lexer above has two rules, both of which match a sequence of exactly two characters. Moreover, they have common matches (the language matched by the second is a strict superset of the first).

In the classic Unix lex, if two rules both match the same length of input, precedence is given to the rule which occurs first in the specification. Otherwise, the longest possible match dominates.

(Although without RTFM, I can't say that that is what happens in CL-LEX, it does make a plausible hypothesis of what is happening in this case.)

It looks like you're missing a regex Kleene operator to match a longer token in the second rule.

I finally found a solution. I use the `\b` operator around the first pattern for `"in"`, like this `"\bin\b"`. This basically forces the lexer to read "in" as a word. — Flávio Cruz, Mar 28 '12 at 18:31

Force CL-Lex to read whole word

2 Answers2