How to code nextToken() function for a descent recursive parser LL(1)

Question

I'm writting a recursive descent parser LL(1) in C++, but I have a problem because I don't know exactly how to get the next token. I know I have to use regular expressions for getting a terminal but I don't know how to get the largest next token.

For example, this lexical and this grammar (without left recursion, left factoring and without cycles):

    //LEXICAL IN FLEX

    TIME [0-9]+
    DIRECTION UR|DR|DL|UL|U|D|L|R
    ACTION A|J|M

    %%

    {TIME}      {printf("TIME"); return (TIME);}
    {DIRECTION} {printf("DIRECTION"); return (DIRECTION);}
    {ACTION}    {printf("ACTION"); return (ACTION);}
    "~"         {printf("RELEASED"); return (RELEASED);}
    "+"         {printf("PLUS_OP"); return (PLUS_OP);}
    "*"         {printf("COMB_OP"); return (COMB_OP);}

    //GRAMMAR IN BISON

    command : list_move PLUS_OP list_action
            | list_move COMB_OP list_action
            | list_move list_action
            | list_move
            | list_action
            ;

    list_move:  move list_move_prm
                ;

    list_move_prm:  move
                  | move list_move_prm
                  | ";"
                  ;
          
    list_action:  ACTION list_action_prm
                  ;

    list_action_prm:  PLUS_OP ACTION list_action_prm
                    | COMB_OP ACTION list_action_prm
                    | ACTION list_action_prm
                    | ";" //epsilon
                    ;

    move: TIME RELEASED DIRECTION
        | RELEASED DIRECTION
        | DIRECTION
        ;

I have a string that contains: "D DR R + A" it should validate it, but getting "DR" I have problems because "D" it's a token too, I don't know how to get "DR" instead "D".

If you're using flex, you can have just `int nextToken() { return yylex(); }`. If you're asking how to implement what flex does, that's a much larger question — Chris Dodd, Aug 04 '19 at 19:18
The idea is not using Flex neither Bison, I want to build my own recursive descent parser. The examples I wrote in Flex and Bison are only for understanding the problem. — Briko, Aug 04 '19 at 19:36

Chris Dodd · Answer 1 · 2019-08-04T20:03:48.510

There are a number of ways of hand-writing a tokenizer

you can use a recusive descent LL(1) parser "all the way down" -- rewrite your grammar in terms of single characters rather than tokens, and left factor it. Then your nextToken() routine becomes just getchar(). You'll end up with additional rules like:
```
TIME: DIGIT more_digits ;
more_digits: /* epsilon */ | DIGIT more_digits ;
DIGIT: '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
DIRECTION: 'U' dir_suffix | 'D' dir_suffix | 'L' | 'R' ;
dir_suffix: /* epsilon */ | 'L' | 'R' ;
```
You can use regexes. Generally this means keeping around a buffer and reading the input into it. nextToken() then runs a series of regexes on the buffer, figuring out which one returns the longest token and returns that, advancing the buffer as needed.
You can do what flex does -- this is the buffer approach above, combined with building a single DFA that evaluates all of the regexes simultaneously. Running this DFA on the buffer then returns the longest token (based on the last accepting state reached before getting an error).

Note that in all cases, you'll need to consider how to handle whitespace as well. You can just ignore whitespace everywhere (FORTRAN style) or you can allow whitespace between some tokens, but not others (eg, not between the digits of TIME or within a DIRECTION, but everywhere else in the grammar). This can make the grammar much more complex (and the process of hand-writing the recursive descent parser much more tedious).

It's an usefull information. I will try it. Thanks so much and thanks for your time. — Briko, Aug 04 '19 at 19:54

score 0 · Answer 2 · answered Aug 06 '19 at 12:03

“I don't know exactly how to get the next token”

Your input comes from a stream (std::istream). You must write a get_token(istream) function (or a tokenizer class). The function must first discard white spaces, then read a character (or more if necessary) analyze it and returns the associated token. The following functions will help you achieve your goal:

ws – discards white-space.
istream::get – reads a character.
istream::putback – puts back in the stream a character (think “undo get”).

"I don't know how to get "DR" instead "D""

Both "D" and "DR" are words. Just read them as you would read a word: is >> word. You will also need a keyword to token map (see std::map). If you read the "D" string, you can ask the map what the associated token is. If not found, throw an exception.

A starting point (run it):

#include <iostream>
#include <iomanip>
#include <map>
#include <string>

enum token_t
{
  END,
  PLUS,
  NUMBER,
  D,
  DR,
  R,
  A,

  // ...
};

// ...

using keyword_to_token_t = std::map < std::string, token_t >;

keyword_to_token_t kwtt =
{
  {"A", A},
  {"D", D},
  {"R", R},
  {"DR", DR}

  // ...

};

// ...

std::string s;
int n;

// ...

token_t get_token( std::istream& is )
{
  char c;

  std::ws( is ); // discard white-space

  if ( !is.get( c ) ) // read a character
    return END; // failed to read or eof

  // analyze the character
  switch ( c )
  {
  case '+': // simple token
    return PLUS;

  case '0': case '1': // rest of digits
    is.putback( c ); // it starts with a digit: it must be a number, so put it back 
    is >> n; // and let the library to the hard work
    return NUMBER;
    //...

  default: // keyword
    is.putback( c );
    is >> s;
    if ( kwtt.find( s ) == kwtt.end() )
      throw "keyword not found";
    return kwtt[ s ];
  }
}

int main()
{
  try
  {
    while ( get_token( std::cin ) )
      ;
    std::cout << "valid tokens";
  }
  catch ( const char* e )
  {
    std::cout << e;
  }
}

How to code nextToken() function for a descent recursive parser LL(1)

2 Answers2