How to return multiple tokens with Jison lexer

Question

I'm new to lexing and parsing so sorry if the title isn't clear enough.

Basically, I'm using Jison to parse some text and I am trying to get the lexer to comprehend indentation. Here's the bit in question:

(\r\n|\r|\n)+\s*      %{
                        parser.indentCount = parser.indentCount || [0];

                        var indentation = yytext.replace(/^(\r\n|\r|\n)+/, '').length;

                        if (indentation > parser.indentCount[0]) {
                           parser.indentCount.unshift(indentation);
                           return 'INDENT';
                        }

                        var tokens = [];

                        while (indentation < parser.indentCount[0]) {
                          tokens.push('DEDENT');
                          parser.indentCount.shift();
                        }

                        if (tokens.length) {
                           return tokens;
                        }

                        if (!indentation.length) {
                          return 'NEWLINE';
                        }
                      %}

So far, almost all of that works as expected. The one problem is the line where I attempt to return an array of DEDENT tokens. It appears that Jison is just converting that array into a string which causes me to get a parse error like Expecting ........, got DEDENT,DEDENT.

What I'm hoping I can do to get around this is manually push some DEDENT tokens onto the stack. Maybe with a function like this.pushToken('DEDENT') or something along those lines. But the Jison documentation is not so great and I could use some help.

Any thoughts?

EDIT:

I seem to have been able to hack my way around this after looking at the generated parser code. Here's what seems to work...

if (tokens.length) {
  var args = arguments;

  tokens.slice(1).forEach(function () {
    lexer.performAction.apply(this, args);
  }.bind(this));

  return 'DEDENT';
}

This tricks the lexer into performing another action using the exact same input for each DEDENT we have in the stack, thus allowing it to add in the proper dedents. However, it feels gross and I'm worried there could be unforeseen problems.

I would still love it if anyone had any ideas on a better way to do this.

score 1 · Answer 1 · answered Oct 23 '16 at 16:20

After a couple of days I ended up figuring out a better answer. Here's what it looks like:

(\r\n|\r|\n)+[ \t]*   %{
                        parser.indentCount = parser.indentCount || [0];
                        parser.forceDedent = parser.forceDedent || 0;

                        if (parser.forceDedent) {
                          parser.forceDedent -= 1;
                          this.unput(yytext);
                          return 'DEDENT';
                        }

                        var indentation = yytext.replace(/^(\r\n|\r|\n)+/, '').length;

                        if (indentation > parser.indentCount[0]) {
                           parser.indentCount.unshift(indentation);
                           return 'INDENT';
                        }

                        var dedents = [];

                        while (indentation < parser.indentCount[0]) {
                          dedents.push('DEDENT');
                          parser.indentCount.shift();
                        }

                        if (dedents.length) {
                           parser.forceDedent = dedents.length - 1;
                           this.unput(yytext);
                           return 'DEDENT';
                        }

                        return `NEWLINE`;
                      %}

Firstly, I modified my capture regex to make sure I wasn't inadvertently capturing extra newlines after a series of non-newline spaces.

Next, we make sure there are 2 "global" variables. indentCount will track our current indentation length. forceDedent will force us to return a DEDENT if it has a value above 0.

Next, we have a condition to test for a truthy value on forceDedent. If we have one, we'll decrement it by 1 and use the unput function to make sure we iterate on this same pattern at least one more time, but for this iteration, we'll return a DEDENT.

If we haven't returned, we get the length of our current indentation.

If the current indentation is greater than our most recent indentation, we'll track that on our indentCount variable and return an INDENT.

If we haven't returned, it's time to prepare to possible dedents. We'll make an array to track them.

When we detect a dedent, the user could be attempting to close 1 or more blocks all at once. So we need to include a DEDENT for as many blocks as the user is closing. We set up a loop and say that for as long as the current indentation is less than our most recent indentation, we'll add a DEDENT to our list and shift an item off of our indentCount.

If we tracked any dedents, we need to make sure all of them get returned by the lexer. Because the lexer can only return 1 token at a time, we'll return 1 here, but we'll also set our forceDedent variable to make sure we return the rest of them as well. To make sure we iterate on this pattern again and those dedents can be inserted, we'll use the unput function.

In any other case, we'll just return a NEWLINE.

How to return multiple tokens with Jison lexer

1 Answers1