I'm trying to build a parser with Jison (a node.js implementation of Bison) to parse a file that looks like this:
---
Redirect Test Patterns
---
one.html /two/
one/two.html /three/four/
one /two
one/two/ /three
one/two/ /three/four
one/two /three/four/five/
one/two.html http://three.four.com/
one/two/index.html http://three.example.com/four/
one http://two.example.com/three
one/two.pdf https://example.com
one/two?query=string /three/four/
go.example.com https://example.com
The goal
This is a file that stores redirection paths/URLs. There are other scripts that refer to this file when they need to know how to redirect a user. The goal is to develop a parser that I can run every time someone attempts to save the file. That way, I can make sure it's always formatted properly.
Basically, everything inside the ---
block is to be ignored, as well as any empty lines. Each of the remaining lines represent a "redirection record".
For each "redirection record", it must have the following structure:
INPUT_URL_OR_PATH <space> OUTPUT_URL_OR_PATH
In other words, there is to be a single space separating two strings.
What I have done so far
I am very new to grammars/parsing, so please bear with me.
The language grammar I have sketched out looks like this:
file -> lines EOF
lines -> record
lines -> lines record
record -> INPATH SPACE OUTPATH
The terminal symbols include: EOF
, INPATH
, SPACE
, OUTPATH
.
Unfortunately, I am not even at the point where I can implement that yet because I am having trouble developing my lexer.
This is what my jison
file looks like:
/* description: Parses a list of redirects */
/* lexical grammar */
%lex
%x comment
%%
"---" this.begin("comment")
<comment>"---" this.popState()
<comment>[\n] /* skip new lines */
<comment>. /* skip all characters */
[ \t\n] /* do nothing */
(\w+) return 'WORD'
<<EOF>> return 'EOF'
. /* do nothing */
/lex
/* operator associations and precedence */
/* n/a */
%start file
%% /* language grammar */
file
: lines EOF
{ console.log($1); return $1; }
| EOF
{ const msg = 'The target file is empty';
console.log(msg);
return msg; }
;
lines
: lines WORD
{ console.log('WORD ', $2) }
| WORD
{ console.log('WORD ', $1) }
;
Clearly, I am very far from being done. I am currently stuck on several things all at the same time.
Things I'm stuck on
- Being able to skip empty lines;
- Tokenizing
INPATH
,SPACE
,OUTPATH
, and; - Using left-recursion in the language grammar section as opposed to right-recursion (What's the difference? Am I even doing it right? What's the best option here?).
In other words, I have no idea what I'm doing and could really use some help.
EDIT I'm going to attempt to do more research and hopefully eventually answer my own question.