Why does the grammar I defined not use tokens?

Question

I'm working on define to a new language with lex and yacc. The lexer works fine but parser doesn't. I thought the problem is the grammar is not recognizing tokens but after lots of researches and trials I'm literally stuck. I'm not pretty sure if the grammar is completely right. I get "syntax error in input" from parser but there is no syntax error in data.

click to see LEX and YACC files

The data input is:

F 100

And here is another example for this grammar:

L 36 [L 4 [F 100 R 90] R 10]

My lexer (lexing.py) code:

import lex

tokens = (
    'NUMBER',
    'RED',
    'GREEN',
    'BLUE',
    'BLACK',
    'FORW',
    'RIGHT',
    'LOOP',
    'COLOR',
    'PEN',
    'LSQB',
    'RSQB',
    'EMPTY'
) 

t_FORW    = r'F'
t_RIGHT   = r'R'
t_LOOP   = r'L'
t_COLOR  = r'COLOR'
t_PEN  = r'PEN'
t_LSQB  = r'\['
t_RSQB  = r'\]'
t_RED  = r'K'
t_GREEN  = r'Y'
t_BLUE  = r'M'
t_BLACK  = r'S'
t_EMPTY = r'\ ' 

def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)    
    return t 

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value) 

t_ignore  = ' \t' 

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1) 

lexer = lex.lex()


data = '''
F 100
''' 

lexer.input(data)
 
for tok in lexer:
    print(tok)

And here is parsing code:

import yacc
from lexing import tokens


def p_root(p):
    '''root : function NUMBER option
            | COLOR colors option
            | PEN NUMBER option '''
    
def p_option(p):
    '''option : root
              | LSQB root RSQB root
              | EMPTY '''
def p_function(p):
    '''function : FORW 
                | RIGHT 
                | LOOP '''
    
def p_colors(p):
    '''colors : RED 
              | BLUE 
              | GREEN 
              | BLACK ''' 
              
def p_error(p):
    print("Syntax error in input!")

from lexing import data

# Build the parser

parser=yacc.yacc()
result=parser.parse(data)
#print (result)

I tried every way that I know. As you see, I didn't wrote the p arguments yet but trying to write that is not the solution. What could be the real problem?

You've imported a tuple of strings from `lexing`. Where do you tell the parser to use them, let alone the functions that actually produce the tokens they represent? — chepner, May 20 '22 at 10:55
Actually i didn't pay attention to what you said about before, but I guess that's the problem. So... I don't know how to tell the parser to use them. Formerly, I thought it was the yacc file that does this job but it seems not. I tried the print parser result and here is the code and output: **parser=yacc.yacc()** **result=parser.parse(data)** **print (result)** **F** **100** **Syntax error in input!** **None** — , May 20 '22 at 12:52
Is it normal to print **None** ? This is my first parsing coding attempt and I can't find a solution as I can't even figure out what the problem is. I'm literally stuck at this point. Here is an interesting situation: I tried a basic grammar before and it worked flawlessly and i didn't tell nothing to do the parser without assigning p arguments (p[0] assignment thing). How is it work then? — , May 20 '22 at 13:03
I've never used either module before, but `lexing.data` is just a string. It looks like `lexer.input` takes a string and makes `lexer` an iterable over the tokens; maybe pass `lexing.lexer` itself to `parser.parse`, instead of `lexing.data`? (I assume the parser wants a stream of tokens, not a stream of individual characters.) — chepner, May 20 '22 at 13:53
@chepner: passing the string to the parse is correct. The lexer creates a generator to make it easier to debug. — rici, May 20 '22 at 14:49
I did not quite understand what you mean but yes, what I want to do is to provide the recognition of tokens by the grammar. You can find LEX and YACC files that i imported [here](https://github.com/dabeaz/ply/tree/master/ply) — , May 20 '22 at 14:51
Euler: i don't believe your lexer produces the correct token stream. Please show the debugging output which convinced you that it is working as expected. — rici, May 20 '22 at 14:55
data is _L 36 [L 4 [F 100 R 90] R 10]_ and debugging output is: (**[ F 100 R 90 ] [ F 100 R90 ] R 10 ]**) It is listing the tokens and printing separately but I can't show that here. Wouldn't that indicate the lexer is working correctly? — , May 20 '22 at 16:15
@euler: Except for not producing the `EMPTY` token that your grammar required, yes. Sorry, I misremembered how `t_ignore` works in Ply; it does, in fact, override patterns. So `EMPTY` couldn't match even if you'd written a space at the end of the input (which you didn't). But, anyway, the correct solution, which I'm glad to see you arrived at, is to make `EMPTY` (or `empty`) a non-terminal which matches the empty string rather than a terminal which matches a space. Or just get rid of it altogether, since you can just write an empty production for any nonterminal. — rici, May 21 '22 at 04:15
@rici Thanks again. I already fix that but there is another problem. Can you help me? — , May 21 '22 at 11:20

rici · Accepted Answer · 2022-05-21T02:34:02.127

The immediate problem is your use of EMPTY as a token (a single space character). That definition conflicts with your list of ignored characters in t_ignore, which is also going to be a problem. But note that there is no space character at the end of your input (the input ends with a newline, which is ignored), and your grammar requires option to end with EMPTY. That's guaranteed to produce a syntax error. (In the first version of this answer, I said that t_ignore is overridden by explicit token patterns, but it turns out that I was wrong. You can use an ignored character inside a rule, but not at the beginning; tokens which start with an ignored character will never be matched.)

Particularly if this is your first project, you should follow a more systematic debugging technique. First ensure that the input is tokenised in the way you expect it to be, without trying to parse the token stream. When you do start parsing the stream, make sure that any syntax errors report the token which produced the error, including its location in the input.

Both the lexer and the parser can be called with the keyword argument debug=True, which will provide a debugging log of parser actions. That should definitely be part of your debugging.

All that said, your grammar strikes me as not very useful for determining the structure of the input. A good grammar reads the same way as you would describe the input in your own language, which might include descriptions like:

the input is a list of commands.
A command could be a Forward command, a right command, ..., or a loop command.
A forward command is an F followed by a number.
A loop command is an L followed by a number followed by a list of commands inside [ and ].

Thanks for this descriptive answer. As I understand I should remove the 'EMPTY' token but how can I define `empty` in the grammar code? And as you said, I followed systematic debugging technique as much as I can. When I run the lexer, it tokenizes the input without any problem. So, I also should write `debug=True` in building the parser part of code. I'll rewrite the grammar with considering your suggestions. Last one question: I understand that the grammar rules must something like that: `grammar::= | | | ... ::= F color::= COLOR ` — , May 20 '22 at 15:53
I had to define the grammar with recursive and iterative rules but how can I define **ε** without using EMPTY token? I need to ignore spaces actually, if I don't, how can the lexer tokenize the data? — , May 20 '22 at 16:04
I solved it thanks to you. Your answer was really helpful. If you want to see what I did, you can check my answer. — , May 20 '22 at 18:29

score 1 · Answer 2 · answered May 20 '22 at 18:17

I solved it finally. Code was perfect, the real problems were me and my absurd grammar.

The previous grammar was:

<root> ::= <function> <numbers> <option> | COLOR <colors> <option> | PEN <numbers> <option>
<option>::= <root> | [ <root> ] <root> | ε
<function>::= F | R | L
<colors>::= K | M | Y | S

New grammar is:

<grammar> ::= <function> | <function> <grammar> | ε 
<function> ::= <forward> | <right> | <loop> | <color> | <pen>
<forward> ::= F <numbers> 
<right> ::= R <numbers>
<loop> ::= L <numbers> <lbracket> <grammar> <rbracket>
<color> ::= COLOR <colors>
<pen> ::= PEN <numbers>
<colors> ::= M | K | S | Y
<lbracket> ::= [
<rbracket> ::= ]

And I deleted the EMPTY token and defined new rule instead of that. The yacc.py file I used includes a special value for using ε. Here is grammar defining part of the code:

def p_start(p):
    '''start : function 
             | function option'''
def p_function(p):
    '''function : forward 
                | right 
                | loop
                | color
                | pen'''
def p_empty(p):
    'empty :'
    pass
def p_option(p):
    '''option : start 
              | empty '''
def p_forward(p):
    'forward : FORW NUMBER'
def p_right(p):
    'right : RIGHT NUMBER'
def p_loop(p):
    'loop : LOOP NUMBER LSQB start RSQB'
def p_color(p):
    'color : COLOR colors'
def p_colors(p):
    '''colors : BLACK 
              | BLUE
              | GREEN
              | RED '''
def p_pen(p):
    'pen : PEN NUMBER'
    
def p_error(p):
   print("Syntax error in input!")

Why does the grammar I defined not use tokens?

2 Answers2