Lex parsing to determine

Question

This is an extension to a previous question. I'm trying to parse a .txt file and determine if each line is valid or invalid depending on my rules. the text files will contain an assortment of random strings, hex, integers and decimals seperated by a single space such as:

5 -0xA98F 0XA98H text hello 2.3 -12 0xabc

I'm trying to identify valid hex, integers and decimals and get an output like so.

5 valid
-0xA98F valid
0xA98H invalid
text invalid
hello invalid
2.3 valid
-12 valid
0xabc invalid

My current code however displays like so:

5 valid
-0xa98f valid
0xA98 valid  <--- issue 1 just remoives the H
2.3 valid <--- ignores text and hello
-12 valid
0xabc invalid

here is the code I current have:

%{
#include <iostream>
using namespace std;
%}
Decimal [+-]?[0-9]+\.[0-9]+?
Integers [+-]?[0-9]+
Hex [-]?[0][xX][0-9A-F]+ 

%%

[ \t\n] ;

{Decimal} {cout << yytext << "Valid" << endl; }
{Integers} { cout << yytext << "Valid" << endl; } 
{Hex} {cout << yytext << " Valid" << endl;}
. ;
%%

main() {

FILE *myfile= fopen("something.txt", "r");
if (!myfile) {
    cout << "Error" << endl;
    return -1;
}

yyin = myfile;

yylex();
fclose(yyin);

}

Are the "things" in your file one per line, as in your question, or different space-separated words, as in your example? — rici, Mar 05 '17 at 01:09
they are typically spaced out like my example. the output is the only thing that becomes one per line. — sippycup, Mar 05 '17 at 01:20
Please *edit your question* to make the requirements more precise. Comments don't count. What exactly is a "thing"? Could they be separated by commas, for example? Precision is *essential* when you are writing parsers. — rici, Mar 05 '17 at 01:24

rici · Answer 1 · 2017-03-05T01:55:16.910

The key to using flex for problems like this is understanding the "maximal-munch" rule. The rule is simple: Flex always picks the action corresponding to the pattern which matches the longest string (starting with the current input point; flex never "searches" for a match.) If more than one pattern matches the same longest substring, then the first pattern in the flex description is chosen. That means that the order of rules is important.

This is described at more length in the Flex manual section on How the Input is Matched.

So let's suppose that you are interested in matching complete words, where "words" are non-empty sequences of arbitrary non-whitespace characters separated by whitespace. (So, for example, the line 3, 4 and 5. would contain only one valid strings.)

It's easy to identify the four possibilities:

Decimal integers
Decimal floating point
Hexadecimal integers
Anything other word.

We also need to ignore whitespace, other than recognizing it as a word separator.

If we put the rules in that order, we can be confident that the correct rule will be chosen for each line, because of the maximal munch rule.

So here's the entire flex file (except for the definition of main):

%option noinput nounput noyywrap nodefault
%%
[[:space:]]+           { /* Ignore whitespace */ }
[+-]?[[:digit:]]+      { printf("%s valid\n", yytext); /* Decimal integer */ }
[+-]?[[:digit:]]+"."[[:digit:]]* {
                         printf("%s valid\n", yytext); /* Decimal point */ }
[+-]?"."[[:digit:]]+   { printf("%s valid\n", yytext); /* Decimal point */ }
[+-]0[xX][[:xdigit:]]+ { printf("%s valid\n", yytext); /* Hexadecimal integer */ }
[^[:space:]]+          { printf("%s invalid\n", yytext); /* Any word not matched by above rules */ }

Notes

I've used ordinary printf statements here. You're free to use C++ streams, of course, but I prefer to use either stdio.h or iostreams, but not both. It might be considered cleaner to #include <stdio.h>, but in fact Flex already does that because it needs it for its own purposes.
The %option statement tells flex that you don't need yywrap (which means you don't need to provide one or link with -lfl), that you don't use input or unput (which means you can compile with -Wall without getting unused function warnings) and that you don't expect flex to need to insert a default rule (which saves you from embarrassing errors, because flex will warn you if there is anything which might not match any rule.)
I used [[:xdigit:]]+ in the hexadecimal pattern, which allows both upper and lower-case hex digits. If that's not desired, you could replace it with [0-9A-F] as in your original code, but your examples seem to indicate that your original code was not correct. Of course, you could write out the posix character classes, but I find them more readable. See the Flex manual section on Patterns for a complete list.

I provided more detail like you suggested if that would help you. — sippycup, Mar 05 '17 at 01:41
@sippycup: OK, I took out the comment about imprecision. (Although it's still not clear to me whether you intended to accept upper- and lower-case hex digits, I changed the pattern based on your example of `0xabc' being valid. — rici, Mar 05 '17 at 01:50
I want lowercase hex digits to be invalid in the working code. — sippycup, Mar 05 '17 at 02:16

Lex parsing to determine

1 Answers1

Notes