How to create a regex without certain group of letters in lex

Question

I've recently started learning lex , so I was practicing and decided to make a program which recognises a declaration of a normal variable. (Sort of)

This is my code :

%{
#include "stdio.h"
%}
dataType "int"|"float"|"char"|"String"
alphaNumeric [_\*a-zA-Z][0-9]*
space [ ]
variable {dataType}{space}{alphaNumeric}+
%option noyywrap
%%
{variable} printf("ok");
. printf("incorect");
%%
int main(){
yylex();
}

Some cases when the output should return ok

int var3
int _varR3
int _AA3_

And if I type as input : int float , it returns ok , which is wrong because they are both reserved words.

So my question is what should I modify to make my expression ignore the 'dataType' words after space?

Thank you.

score 2 · Answer 1 · edited Nov 18 '17 at 04:08

A preliminary consideration: Typically, detection of the construction you point out is not done at the lexing phase, but at the parsing phase. On yacc/bison, for instance, you would have a rule that only matches a "type" token followed by an "identifier" token.

To achieve that with lex/flex though, you could consider playing around with the negation (^) and trailing context (/) operators. Or...

If you're running flex, perhaps simply surrounding all your regex with parenthesis and passing the -l flag would do the trick. Notice there are a few differences between lex and flex, as described in the Flex manual.

rici · Accepted Answer · 2015-12-07T03:41:13.577

This is really not the way to solve this particular problem.

The usual way of doing it would be to write separate pattern rules to recognize keywords and variable names. (Plus a pattern rule to ignore whitespace.) That means that the tokenizer will return two tokens for the input int var3. Recognizing that the two tokens are a valid declaration is the responsibility of the parser, which will repeatedly call the tokenizer in order to parse the token stream.

However, if you really want to recognize two words as a single token, it is certainly possible. (F)lex does not allow negative lookaheads in regular expressions, but you can use the pattern matching precedence rule to capture erroneous tokens.

For example, you could do something like this:

dataType       int|float|char|String
id             [[:alpha:]_][[:alnum:]_]*

%%

{dataType}[[:white:]]+{dataType}   { puts("Error: two types"); }
{dataType}[[:white:]]+{id}         { puts("Valid declaration"); }

  /* ...  more rules ... */

The above uses Posix character classes instead of writing out the possible characters. See man isalpha for a list of Posix character classes; the character class component [:xxxxx:] contains exactly the characters accepted by the isxxxxx standard library function. I fixed the pattern so that it allows more than one space between the dataType and the id, and simplified the pattern for ids.

How to create a regex without certain group of letters in lex

2 Answers2