Regex for lexing first and second string (separately) in a pair

Question

I'm trying to write a lexer to parse a file like that looks this:

one.html /two/
one/two/ /three
three/four http://five.com

Each line has two strings separated by a space. I need to create two regex patterns: one to match the first string, and another to match the second string.

This is my attempt at the regex for the lexer (a file named lexer.l to be run by flex):

%%
(\S+)(?:\s+\S+)   { printf("FIRST %s\n", yytext); }
(?:\S+\s+)(\S+)   { printf("SECOND %s\n", yytext); }
.                 { printf("Mystery character %s\n", yytext); }
%%

I have tested both (\S+)(?:\s+\S+) and (?:\S+\s+)(\S+) in the Regex101 tester and they both seem to be working properly: https://regex101.com/r/FQTO15/1

However, when i try to build the lexer by running flex lexer.l, I get an error:

lexer.l:3: warning, rule cannot be matched

This is referring to the second rule I have. If I attempt to reverse the order of the rules, I get the error on the second one yet again. If I only leave in one of the rules, it works perfectly fine.

I believe this issue has to do with the fact that both regexes are similar and of the same length, so flex sees it as ambiguous, even though the two regexes capture different things (but they match the same things?).

Is there anything I can do with the regex so that it will capture/match what I want without clashing with each other?

EDIT: More Test Examples

one.html /two/
one/two.html /three/four/
one /two
one/two/ /three
one_two/ /three
one%20two/ /three
one/two/ /three/four
one/two /three/four/five/
one/two.html http://three.four.com/
one/two/index.html http://three.example.com/four/
one http://two.example.com/three
one/two.pdf https://example.com
one/two?query=string /three/four/
go.example.com https://example.com

EDIT

It turns out that the regex engine used by flex is rather limited. It cannot do grouping and it also doesn't seem to use \s for spaces.

So this wouldn't work:

^.*\s.*$

But this does:

^.*" ".*$

Thanks to @fossil for all their help.

Why not use one single expression like `^(.*)\s(.*)$` and use the matched groups in the code — fossil, Jan 23 '17 at 00:45
@fossil do you mean to use your regex to match the whole line, and then use normal C code to split it up by the space? — adrianmcli, Jan 23 '17 at 00:46
@fossil I just tried your regex and: (1) `yytext` seems to contain both strings, (2) It fails to match this: `one.html two/three` — adrianmcli, Jan 23 '17 at 00:56
@fossil I've updated my post with more test patterns, thanks. — adrianmcli, Jan 23 '17 at 01:01
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/133769/discussion-between-fossil-and-adrianmc). — fossil, Jan 23 '17 at 01:03

rici · Accepted Answer · 2017-01-23T02:45:56.737

Although there are ways to solve your problem as stated, I think you would be better off understanding the intended use of (f)lex, and to find a solution consistent with its processing model.

(F)lex is intended to split an input into individual tokens. Each token has a type, and it is expected that it is possible to figure out the type of a token simply by looking at it (and not at its context). The classic model of a token type are the objects in a computer program, where we have, for example, identifiers, numbers, certain keywords, and various operators. Given an appropriate set of rules, a (f)lex scanner will take an input like

a = b*7 + 2;

and produce a stream of tokens:

identifier = identifier * number + number ;

Each of these tokens has an associated "semantic value" (which not all of them actually require), so that the two identifier tokens and the two number are not just anonymous blobs.

Note that a and b in the above line have different roles. a is being assigned to, while b is being referred to. But that's not relevant to their form, and it is not evident from their form. They are just tokens. Figuring out what they mean and their relationship with each other is the role of a parser, which is a separate part of the parsing model. The intention of the two-phase scan/parse paradigm is to simplify both tasks by abstracting away complications: the scanner knows nothing about context or meaning, while the parser can deduce the logical structure of the input without concerning itself with the messy details of representation and irrelevant whitespace.

In many ways, your problem is a bit outside of this paradigm, in part because the two token types you have cannot be distinguished on the basis of their appearance alone. If they have no useful internal structure, though, then you could just accept that your input consists of

"paths", which do not contain whitespace, and
newline characters.

You could then use a combination of a lexer and a parser to break the input into lines:

File splitter.l

%{
#include "splitter.tab.h"
%}
%option noinput nounput noyywrap nodefault
%%
\n             { return '\n'; }
[^[:space:]]+  { yylval = strdup(yytext); return PATH; }
[[:space:]]    /* Ignore whitespace other than newlines */

File splitter.y

%code { 
#include <stdio.h>
#include <stdlib.h>

int yylex();
void yyerror(const char* msg);
}

%code requires {
#define YYSTYPE char*
}

%token PATH

%%

lines: %empty
     | lines line '\n'

line : %empty
     | PATH PATH       { printf("Map '%s' to '%s'\n", $1, $2);
                         free($1); free($2);
                       }

%%
void yyerror(const char* msg) {
  fprintf(stderr, "%s\n", msg);
}

int main(int argc, char** argv) {
  return yyparse();
}

Quite a lot of the above is boiler-plate; it's worth concentrating just on the grammar and the token patterns.

The grammar is very simple:

lines: %empty
     | lines line '\n'

line : %empty
     | PATH PATH       { printf("Map '%s' to '%s'\n", $1, $2);
                         free($1); free($2);
                       }

The interesting line is the last one, which says that a line consists of two PATHs. That handles each line by printing it out, although you'd probably want to do something different. It is this line which understands that the first word on a line and the second word on the same line have different functions. Note that it doesn't need the lexer to label the two words as "FIRST" and "SECOND", since it can see that all by itself :)

The two calls to free release the memory allocated by strdup in the lexer, thus avoiding a memory leak. In a real application, you'd need to make sure you don't free the strings until you don't need them any more.

The lexer patterns are also very simple:

\n             { return '\n'; }
[^[:space:]]+  { yylval = strdup(yytext); return PATH; }
[[:space:]]    /* Ignore whitespace other than newlines */

The first one returns a special single-character token, a newline character, to for the end-of-line token. The second one matches any string of non-whitespace characters. ((F)lex doesn't know about GNU regex extensions, so it doesn't have \s and friends. It does, however, have the much more readable Posix character classes, which are listed in the flex manual, among other places. The third pattern skips any whitespace. Since \n was already handled by the first pattern, it cannot be matched here (which is why this pattern is a single whitespace character and not a repetition.)

In the second pattern, we assign a value to yylval, which is the semantic value of the token. (We don't do this elsewhere because the newline token doesn't need a semantic value.) yylval always has type YYSTYPE, which we have arranged to be char* by a #define. Here, we just set it from yytext, which is the string of characters (f)lex has just matched. It is important to make a copy of this string because yytext is part of the lexer's internal structure, and its value will change without warning. Having made a copy of the string, we are then obliged to ensure that the memory is eventually released.

To try this program out:

bison -o splitter.tab.c -d splitter.y
flex -o  splitter.lex.c splitter.l
gcc -Wall -O2 -o splitter splitter.tab.c splitter.lex.c

When you say that you should be able to "figure out the type of a token simply by looking at it (and not at its context)", is this what "context-free grammar" means? — adrianmcli, Jan 23 '17 at 06:38
@adrianmc: No. The "context-free" in "context-free grammar" is a technical term with a very specific meaning related to the definition of CFGs, and it does not correspond with most people's intuitions about what "context-free" might mean. — rici, Jan 23 '17 at 06:46

Regex for lexing first and second string (separately) in a pair

EDIT: More Test Examples

EDIT

1 Answers1

File splitter.l

File splitter.y