0

I'm attempting to use Bison to develop my own programming language. I've got the .y file written for my grammar. However, I'm wondering if there's a way, in the case that the user attempts to parse source code with invalid grammar, to have Bison give a useful error message. For example, suppose I have the following rule in my grammar:

if_statement: IF expr '{' statement_list '}' {$$=createNode(IF,$2,$4);}
    ;

Suppose the source code left out the closing brace. According to my understanding, Bison would report that it was unable to find a rule to reduce the code. Could Bison be made to recognize that there is an unfinished if which begins on line such-and-such and report that to the user?

Daniel Walker
  • 6,380
  • 5
  • 22
  • 45

1 Answers1

1

Missing braces are very rarely detected where they happen, because it is usually the case that whatever follows the missing brace could just as well have come before it. That's particularly clear if the missing close brace is immediately followed by another closing brace, but it could simply be followed (in this case) by another statement:

function some_function() {
    ....
    while (some_condition) {
        ...
        if (some_other_condition) {
            ...
            break;
//      }          /* Commented out by mistake */
        a = 3;
        ...
    }
    return a;
}

function another_function() {
    ...
}

If your language doesn't allow nested function definitions then the definition of another_function will trigger an error; if it does allow nested function definitions, then another_function will just be defined in an unexpected scope and the parse will continue, perhaps until the end of file.

One way of detecting errors like this is to check indentation of every line with the expected indentation. However, unless your language has some concept of correct indentation (like, for example, Python), you cannot flag misleading indentation as an error. So the best you can do is record the unexpected indentation, in order to use it as a clue when a syntax error is finally encountered (if there is a syntax error, since it might just be that the programmer doesn't care to make their programmes human-readable). The complications in this approach to error detection are probably why it is so uncommon in mainstream languages, although personally I think it's an approach with a lot of potential.

I usually advocate parsing erroneous programs twice. The first parse is optimised for correct programs, which means that it doesn't need any of the overhead required for good error messages, such as tracking the position of every token. If the program turns out to be syntactically correct, you can then move on to turning the AST into compiled code. If the program turns out to have an syntax error, you can restart the parse at the beginning, and then you are certainly free to use heuristics like indentation checks to attempt to better localise errors.

Having said all that, you may well do better to move on to implementation of your language and return to the problem of producing better diagnostics later.

Bison does offer a mechanism for producing more useful error messages in some cases.

First, you should at least enable line number tracking from Flex, which is almost zero effort. You might also want to track precise token position, which is a bit more work but not too much. (See Character Position from starting of a line, https://stackoverflow.com/a/48879103/1566221 and yyllocp->first_line returns uninitialized value in second iteration of a reEntrant Bison parser (among others) for sample code.)

Second, ask bison to produce verbose error messages. That only requires two extra lines in your bison prologue:

%define parse.error verbose
%define parse.lac full

Please do read the bison manual for some important caveats. In particular, LAC may involve significant overhead. But the error messages produced are often helpful.

Finally, use bison's error recovery mechanism to continue the parse after the first syntax error is detected, thus allowing you to report several syntax errors in a single run. That's usually less frustrating for a user, although you should terminate the parse at some threshold error count, because really high error counts after error recovery usually mean that the error recovery itself failed and that many of the subsequent error messages were bogus.

Again, the bison manual has some useful suggestions about how to use the error facilities.

Bison manual table of contents

rici
  • 234,347
  • 28
  • 237
  • 341
  • My language, like C et al, ignores whitespace. So, I had in mind the situation where EOF is reached with incomplete reduction of the code. Could Bison tell me, "I was trying to parse an if block when EOF was reached"? – Daniel Walker Jun 27 '20 at 01:19
  • 1
    @DanielWalker: If you enable verbose error messages then you might see an error saying that an `else` or a `}` was expected. Bison makes a list of possible tokens, and if the list is not too long it includes them in the error message. If you want the error message to say that an `if` was not terminated, you can put an `error` production alternative to the `if-stmt` rule. But you'd want to add such productions to other blocks as well. Experimentation is useful. Reading the manual is essential. – rici Jun 27 '20 at 01:56
  • 1
    Also I understood that your language ignores whitespace. My claim is that by looking at whitespace, a compiler may be able to work out what the user's *intention* was, which could allow it to make a better guess about the nature of the error. All syntax error messages are guesses; that is important to understand, both as a compiler writer and as a user. The compiler writer needs to make the best guess possible and the compiler user needs to interpret the guess as a guess :-)} – rici Jun 27 '20 at 01:58