Although there are ways to solve your problem as stated, I think you would be better off understanding the intended use of (f)lex, and to find a solution consistent with its processing model.
(F)lex is intended to split an input into individual tokens. Each token has a type, and it is expected that it is possible to figure out the type of a token simply by looking at it (and not at its context). The classic model of a token type are the objects in a computer program, where we have, for example, identifiers, numbers, certain keywords, and various operators. Given an appropriate set of rules, a (f)lex scanner will take an input like
a = b*7 + 2;
and produce a stream of tokens:
identifier = identifier * number + number ;
Each of these tokens has an associated "semantic value" (which not all of them actually require), so that the two identifier tokens and the two number are not just anonymous blobs.
Note that a
and b
in the above line have different roles. a
is being assigned to, while b
is being referred to. But that's not relevant to their form, and it is not evident from their form. They are just tokens. Figuring out what they mean and their relationship with each other is the role of a parser, which is a separate part of the parsing model. The intention of the two-phase scan/parse paradigm is to simplify both tasks by abstracting away complications: the scanner knows nothing about context or meaning, while the parser can deduce the logical structure of the input without concerning itself with the messy details of representation and irrelevant whitespace.
In many ways, your problem is a bit outside of this paradigm, in part because the two token types you have cannot be distinguished on the basis of their appearance alone. If they have no useful internal structure, though, then you could just accept that your input consists of
- "paths", which do not contain whitespace, and
- newline characters.
You could then use a combination of a lexer and a parser to break the input into lines:
File splitter.l
%{
#include "splitter.tab.h"
%}
%option noinput nounput noyywrap nodefault
%%
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
File splitter.y
%code {
#include <stdio.h>
#include <stdlib.h>
int yylex();
void yyerror(const char* msg);
}
%code requires {
#define YYSTYPE char*
}
%token PATH
%%
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
%%
void yyerror(const char* msg) {
fprintf(stderr, "%s\n", msg);
}
int main(int argc, char** argv) {
return yyparse();
}
Quite a lot of the above is boiler-plate; it's worth concentrating just on the grammar and the token patterns.
The grammar is very simple:
lines: %empty
| lines line '\n'
line : %empty
| PATH PATH { printf("Map '%s' to '%s'\n", $1, $2);
free($1); free($2);
}
The interesting line is the last one, which says that a line
consists of two PATH
s. That handles each line by printing it out, although you'd probably want to do something different. It is this line which understands that the first word on a line and the second word on the same line have different functions. Note that it doesn't need the lexer to label the two words as "FIRST" and "SECOND", since it can see that all by itself :)
The two calls to free
release the memory allocated by strdup
in the lexer, thus avoiding a memory leak. In a real application, you'd need to make sure you don't free the strings until you don't need them any more.
The lexer patterns are also very simple:
\n { return '\n'; }
[^[:space:]]+ { yylval = strdup(yytext); return PATH; }
[[:space:]] /* Ignore whitespace other than newlines */
The first one returns a special single-character token, a newline character, to for the end-of-line token. The second one matches any string of non-whitespace characters. ((F)lex doesn't know about GNU regex extensions, so it doesn't have \s
and friends. It does, however, have the much more readable Posix character classes, which are listed in the flex manual, among other places. The third pattern skips any whitespace. Since \n
was already handled by the first pattern, it cannot be matched here (which is why this pattern is a single whitespace character and not a repetition.)
In the second pattern, we assign a value to yylval
, which is the semantic value of the token. (We don't do this elsewhere because the newline token doesn't need a semantic value.) yylval
always has type YYSTYPE
, which we have arranged to be char*
by a #define
. Here, we just set it from yytext
, which is the string of characters (f)lex has just matched. It is important to make a copy of this string because yytext
is part of the lexer's internal structure, and its value will change without warning. Having made a copy of the string, we are then obliged to ensure that the memory is eventually released.
To try this program out:
bison -o splitter.tab.c -d splitter.y
flex -o splitter.lex.c splitter.l
gcc -Wall -O2 -o splitter splitter.tab.c splitter.lex.c