Branching at the parser lever based on the content of a token

Question

I'm working on a simple example parser/lexer for a tiny project, but I've run into a problem.

I'm parsing content along these lines:

Name SEP Gender SEP Birthday
Name SEP Gender SEP Birthday

… where SEP is any one (but not multiple!) of |, ,, or whitespace.

Now, I didn't want to lock the field-order in at the lexer order, so I'm trying to lex this with a very simple set of tokens:

%token <string> SEP
%token <string> VAL
%token NL

%token EOF

Now, it's dictated that I produce a parse-error if, for instance, the gender field doesn't contain a small set of per-determined values, say {male,female,neither,unspecified}. I can wrap the parser and deal with this, but I'd really like to encode this requirement into the automaton for future expansion.

My first attempt, looking something like this, failed horribly:

doc:
   | EOF              { [] }
   | it = rev_records { it }
   ;

rev_records:
           | (* base-case: empty *) { [] }
           | rest = rev_records; record; NL  { record :: rest }
           | rest = rev_records; record; EOF { record :: rest }
           ;

record:
   last_name = name_field; SEP; first_name = name_field; SEP;
   gender = gender_field; SEP; favourite_colour = colour_field; SEP;
   birthday = date_field
   { {last_name; first_name; gender; favourite_colour; birthday} }

name_field: str = VAL { str }

gender_field:
            | VAL "male" { Person.Male }
            | VAL "female" { Person.Female }
            | VAL "neither" { Person.Neither }
            | VAL "unspecified" { Person.Unspecified }
            ;

Yeah, no dice. Obviously, my attempt at an unstructured-lexing is already going poorly.

What's the idiomatic way to parse something like this?

I'm not an expert at parsers, but I would have tokenized the accepted values since they have syntactic value, and then defined `gender_field` as a union of these tokens. — Richard-Degenne, Jul 18 '18 at 07:58

ivg · Answer 1 · 2018-07-19T14:04:11.870

Parsers, such as Menhir and OCamlYacc, operate on tokens, not on strings or characters. The transformation from characters to tokens is made on the lexer level. That's why you can't specify a string in the production rule.

You can, of course, perform any check in the semantic action and raise an exception, e.g.,

record:
   last_name = name_field; SEP; first_name = name_field; SEP;
   gender_val = VAL; SEP; favourite_colour = colour_field; SEP;
   birthday = date_field
   { 
     let gender = match gender_val with
     | "male" -> Person.Male
     | "female" -> Person.Female
     | "neither" -> Person.Neither
     | "unspecified" -> Person.Unspecified
     | _ -> failwith "Parser error: invalid value in the gender field" in
      {last_name; first_name; gender; favourite_colour; birthday}   
    }

You can also tokenize possible gender or you can use regular expressions on the lexer level to prevent invalid fields, e.g.,

rule token = parser
| "male" | "female" | "neither" | "unspecified" as -> {GENDER s}
...

However, this is not recommended, as it will, in fact, turn male, female, etc into keywords, so their occurrences in other places will break your grammar.

Maybe you should add the constructors like `Person.Male` that are used in the question. — PatJ, Jul 19 '18 at 13:34
Also, handling the gender at lexing time is not that ideal, those words could appear elsewhere and break stuff. — PatJ, Jul 19 '18 at 13:36
yep, that's a good point. It basically makes the specified genders keywords. So it shouldn't be a parser error on the first hand. — ivg, Jul 19 '18 at 13:59

Branching at the parser lever based on the content of a token

1 Answers1