11

I wounder how to implement indentation as block delimiters in bison + flex. Just like in python. I'm writing my own programming language ( mostly for fun, but I intend to use it together with a game engine ), I'll try to come up with something special that minimizes boilerplate and maximizes dev speed.

I have already written an compiler ( actually a `langToy' to Nasm translator ) in C, but failed. By some reason it was only able to handle one string in the whole source file ( well, I had been awake for more than 48 hours - so... You know, brain meltdown ).

I don't know if curly brackets and/or begin -> end are easier to implement ( I don't have problem doing that ) or if it's just my brain that locks up.

Thanks in advance!


Update: Okay, I have no clue about how to do it with flex. I have problems with returning multiple DEDENTs to the parser. Flex/Bison are relatively new to me.


Update 2: This is the flex-file I've come up with so far; it does not quite get it:

%x t
%option noyywrap

%{
  int lineno = 0, ntab = 0, ltab = 0, dedent = 0;
%}

%%

<*>\n  { ntab = 0; BEGIN(t); }
<t>\t  { ++ntab; }
<t>.   { int i; /* my compiler complains not c99 if i use for( int i=0... */
         if( ntab > ltab )
           printf("> indent >\n");
         else if( ntab < ltab )
           for( i = 0; i < ltab - ntab; i++ )
             printf("< dedent <\n");
         else
           printf("=        =\n");

         ltab = ntab; ntab = 0;
         BEGIN(INITIAL);
         /* move to next rule */
         REJECT;}
.    /* ignore everything else for now */

%%

main()
{
  yyin = fopen( "test", "r" );
  yylex();
}

You can try to play around with it, maybe you sees what I'm missing. returning multiple dedents would be an ease in Haxe ( return t_dedent( num ); ).

This code doesn't always match the indents/dedents correctly.


Update 3: I think that I will give up hope on flex and do it my own way, If anyone knows how to do it in flex I would be happy to hear it anyways.

Gama11
  • 31,714
  • 9
  • 78
  • 100
Frank
  • 2,640
  • 2
  • 21
  • 21

4 Answers4

16

What you need to do is have flex count the amount of whitespace at the beginning of every line and insert an appropriate number of INDENT/UNINDENT tokens for the parser to use to group things. One question is what you want to do about tabs vs spaces -- do you just want to have them be equivalent with fixed tab stops, or do you want to require indenting to be consistent (so if one line begins with a tab and the next with a space, you signal an error, which is probably a little harder).

Assuming you want fixed 8-column tabstops, you can use something like

%{
/* globals to track current indentation */
int current_line_indent = 0;   /* indentation of the current line */
int indent_level = 0;          /* indentation level passed to the parser */
%}

%x indent /* start state for parsing the indentation */
%s normal /* normal start state for everything else */

%%
<indent>" "      { current_line_indent++; }
<indent>"\t"     { current_line_indent = (current_line_indent + 8) & ~7; }
<indent>"\n"     { current_line_indent = 0; /*ignoring blank line */ }
<indent>.        {
                   unput(*yytext);
                   if (current_line_indent > indent_level) {
                       indent_level++;
                       return INDENT;
                   } else if (current_line_indent < indent_level) {
                       indent_level--;
                       return UNINDENT;
                   } else {
                       BEGIN normal;
                   }
                 }

<normal>"\n"     { current_line_indent = 0; BEGIN indent; }
... other flex rules ...

You do have to make sure you start the parse in indent mode (to get the indentation on the first line).

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
  • Seems like you got it, but I want tabstops to count as 2 spaces. So I guesses that the line should be current_line_indent = (current_line_indent + 2) & ~1; – Frank Oct 03 '09 at 17:39
  • Yes -- when you see a tab, you need to bump current_line_indent to the next tabstop. – Chris Dodd Oct 04 '09 at 05:37
6

Chris' answer goes a long way towards a usable solution, thanks a bunch for this! Unfortunately, it is missing a few more important aspects which I needed:

  • Multiple outdents (unindents) at once. Consider the following code should emit two outdents after the call to baz:

    def foo():
      if bar:
        baz()
    
  • Emit outdents when the end of the file is reached and still is in some indentation level.

  • Indentation levels of different size. Chris' current code only works correctly for 1-space indents.

Based on Chris' code, I came up with a solution which works in all the cases I have come across so far. I have created a template project for parsing indentation-based text using flex (and bison) on github: https://github.com/lucasb-eyer/flex-bison-indentation. It is a fully working (CMake-based) project which also tracks the line position and the column range of the current token.

Just in case the link should break for whatever reason, here is the meat of the lexer:

#include <stack>

int g_current_line_indent = 0;
std::stack<size_t> g_indent_levels;
int g_is_fake_outdent_symbol = 0;

static const unsigned int TAB_WIDTH = 2;

#define YY_USER_INIT { \
    g_indent_levels.push(0); \
    BEGIN(initial); \
}
#include "parser.hh"

%}

%x initial
%x indent
%s normal

%%
    int indent_caller = normal;

 /* Everything runs in the <normal> mode and enters the <indent> mode
    when a newline symbol is encountered.
    There is no newline symbol before the first line, so we need to go
    into the <indent> mode by hand there.
 */
<initial>.  { set_yycolumn(yycolumn-1); indent_caller = normal; yyless(0); BEGIN(indent); }
<initial>\n { indent_caller = normal; yyless(0); BEGIN(indent); }    

<indent>" "     { g_current_line_indent++; }
<indent>\t      { g_current_line_indent = (g_current_line_indent + TAB_WIDTH) & ~(TAB_WIDTH-1); }
<indent>\n      { g_current_line_indent = 0; /* ignoring blank line */ }
<indent><<EOF>> {
                    // When encountering the end of file, we want to emit an
                    // outdent for all indents currently left.
                    if(g_indent_levels.top() != 0) {
                        g_indent_levels.pop();

                        // See the same code below (<indent>.) for a rationale.
                        if(g_current_line_indent != g_indent_levels.top()) {
                            unput('\n');
                            for(size_t i = 0 ; i < g_indent_levels.top() ; ++i) {
                                unput(' ');
                            }
                        } else {
                            BEGIN(indent_caller);
                        }

                        return TOK_OUTDENT;
                    } else {
                        yyterminate();
                    }
                }

<indent>.       {
                    if(!g_is_fake_outdent_symbol) {
                        unput(*yytext);
                    }
                    g_is_fake_outdent_symbol = 0;
                    // -2: -1 for putting it back and -1 for ending at the last space.
                    set_yycolumn(yycolumn-1);

                    // Indentation level has increased. It can only ever
                    // increase by one level at a time. Remember how many
                    // spaces this level has and emit an indentation token.
                    if(g_current_line_indent > g_indent_levels.top()) {
                        g_indent_levels.push(g_current_line_indent);
                        BEGIN(indent_caller);
                        return TOK_INDENT;
                    } else if(g_current_line_indent < g_indent_levels.top()) {
                        // Outdenting is the most difficult, as we might need to
                        // outdent multiple times at once, but flex doesn't allow
                        // emitting multiple tokens at once! So we fake this by
                        // 'unput'ting fake lines which will give us the next
                        // outdent.
                        g_indent_levels.pop();

                        if(g_current_line_indent != g_indent_levels.top()) {
                            // Unput the rest of the current line, including the newline.
                            // We want to keep it untouched.
                            for(size_t i = 0 ; i < g_current_line_indent ; ++i) {
                                unput(' ');
                            }
                            unput('\n');
                            // Now, insert a fake character indented just so
                            // that we get a correct outdent the next time.
                            unput('.');
                            // Though we need to remember that it's a fake one
                            // so we can ignore the symbol.
                            g_is_fake_outdent_symbol = 1;
                            for(size_t i = 0 ; i < g_indent_levels.top() ; ++i) {
                                unput(' ');
                            }
                            unput('\n');
                        } else {
                            BEGIN(indent_caller);
                        }

                        return TOK_OUTDENT;
                    } else {
                        // No change in indentation, not much to do here...
                        BEGIN(indent_caller);
                    }
                }

<normal>\n    { g_current_line_indent = 0; indent_caller = YY_START; BEGIN(indent); }
LucasB
  • 3,253
  • 1
  • 28
  • 31
  • My code produces an INDENT/UNINDENT for *every* space of indentation. So for your example with 2-space indents, it will produce two INDENT tokens after the first line, another 2 after the second, and 4 UNINDENT at the end. So you'll need to have your parser "ignore" extra redundant INDENT/UNINDENT pairs. Collapsing them in the lexer is hard if you want to catch trailing reduced indent properly, but if you don't care about that, you can use a stack of indent levels rather than a single counter. – Chris Dodd Dec 30 '16 at 19:00
1

Curly brackets (and such) are only simpler if you use a tokenizer that strips out all whitespace (using is just to separate tokens). See this page (the section "How does the compiler parse the indentation?") for some ideas on python tokenizing.

If you are not doing tokenizing before parsing, then there may be additional work to do, it depends on how you are building the parser.

Kathy Van Stone
  • 25,531
  • 3
  • 32
  • 40
-1

You need a rule that looks analogous to this(supposing you use tabs for your indents):

\t: {return TABDENT; }

Frankly, I've always found braces(or begin/end) to be easier to write and to read, both as a human and as a lexer/parser writer.

Paul Nathan
  • 39,638
  • 28
  • 112
  • 212
  • Well, seems like it's at least easier to write a lexer with special block-begin and block-end symbols. It's **not** easier to write { and } on my localized keyboard ;D – Frank Sep 12 '09 at 00:38