I'm working on a simple example parser/lexer for a tiny project, but I've run into a problem.
I'm parsing content along these lines:
Name SEP Gender SEP Birthday
Name SEP Gender SEP Birthday
… where SEP
is any one (but not multiple!) of |
, ,
, or whitespace.
Now, I didn't want to lock the field-order in at the lexer order, so I'm trying to lex this with a very simple set of tokens:
%token <string> SEP
%token <string> VAL
%token NL
%token EOF
Now, it's dictated that I produce a parse-error if, for instance, the gender
field doesn't contain a small set of per-determined values, say {male,female,neither,unspecified}
. I can wrap the parser and deal with this, but I'd really like to encode this requirement into the automaton for future expansion.
My first attempt, looking something like this, failed horribly:
doc:
| EOF { [] }
| it = rev_records { it }
;
rev_records:
| (* base-case: empty *) { [] }
| rest = rev_records; record; NL { record :: rest }
| rest = rev_records; record; EOF { record :: rest }
;
record:
last_name = name_field; SEP; first_name = name_field; SEP;
gender = gender_field; SEP; favourite_colour = colour_field; SEP;
birthday = date_field
{ {last_name; first_name; gender; favourite_colour; birthday} }
name_field: str = VAL { str }
gender_field:
| VAL "male" { Person.Male }
| VAL "female" { Person.Female }
| VAL "neither" { Person.Neither }
| VAL "unspecified" { Person.Unspecified }
;
Yeah, no dice. Obviously, my attempt at an unstructured-lexing is already going poorly.
What's the idiomatic way to parse something like this?