6

I'm parsing a language that has a statement 'code' followed by '{', followed by a bunch of code that I have no interest in parsing, followed by '}'. I'd ideally like to have a rule like:

skip_code: 'code' '{' ~['}']* '}'

..which would simply skip ahead to the closing curly brace. The problem is that the code being skipped could itself have pairs of curly braces. So, what I essentially need to do is run a counter and increment on each '{' and decrement on each '}', and end the parse rule when the counter is back to 0.

What's the best way of doing this in ANTLR4? Should I skip off to a custom function when 'code' is detected and swallow up the tokens and run my counter, or is there some elegant way to express this in the grammar itself?

EDIT: Some sample code, as requested:

class foo;
  int m_bar;
  function foo_bar;
     print("hello world");
  endfunction
  code {
     // This is some C code
     void my_c_func() {
        printf("I have curly braces {} in a string!");
     }
  }
  function back_to_parsed_code;
  endfunction
endclass
Stan
  • 1,227
  • 12
  • 26
  • Could you post a real example of the code you're parsing? – Bart Kiers Dec 28 '16 at 07:45
  • Are there string literals (that might include a `{` or `}`) inside the code block you want to ignore? Are there comments inside those code blocks (that might include a `{` or `}`)? You could go for Mike's suggestion, but discarding these code blocks during lexing might be easier. Discarding them in the parser would mean that everything inside the `{ ... }` will still need to be tokenized. – Bart Kiers Dec 28 '16 at 12:03
  • @BartKiers Yeah, the content within the curly braces could be considered fully legal C code, with its own strings, curly braces, and so on, which make it tricky to do in lexer. Ideally, I don't want to even tokenize that code, but Mike's suggestion does make it very easy to implement in the parser. Any suggestions on how that could be done in the lexer? – Stan Dec 28 '16 at 16:46
  • Have you tried my block rule as a lexer rule? Should still work. The only restriction in both cases is that the curly braces must be balanced. – Mike Lischke Dec 28 '16 at 17:18
  • @Mike, it won't work if the code block contain strings or comment containing braces themselves. – Bart Kiers Dec 28 '16 at 17:31
  • @Stan I'll post a small demo later on, if someone else doesn't before that time. – Bart Kiers Dec 28 '16 at 17:32

3 Answers3

5

I'd use something like:

skip_code: CODE_SYM block;
block: OPEN_CURLY (~CLOSE_CURLY | block)* CLOSE_CURLY;

CODE_SYM: 'code';
OPEN_CURLY: '{';
CLOSE_CURLY: '}';
Mike Lischke
  • 48,925
  • 16
  • 119
  • 181
2

I'd handle these code blocks in the lexer. A quick demo:

import org.antlr.v4.runtime.ANTLRInputStream;
import org.antlr.v4.runtime.Token;

public class Main {

    public static void main(String[] args) {

        String source = "class foo;\n" +
                "  int m_bar;\n" +
                "  function foo_bar;\n" +
                "     print(\"hello world\");\n" +
                "  endfunction\n" +
                "  code {\n" +
                "     // This is some C code }}} \n" +
                "     void my_c_func() {\n" +
                "        printf(\"I have curly braces {} in a string!\");\n" +
                "     }\n" +
                "  }\n" +
                "  function back_to_parsed_code;\n" +
                "  endfunction\n" +
                "endclass";

        System.out.printf("Tokenizing:\n\n%s\n\n", source);

        DemoLexer lexer = new DemoLexer(new ANTLRInputStream(source));

        for (Token t : lexer.getAllTokens()){
            System.out.printf("%-20s '%s'\n",
                    DemoLexer.VOCABULARY.getSymbolicName(t.getType()),
                    t.getText().replaceAll("[\r\n]", "\\\\n")
            );
        }
    }
}

If you run the class above, the following will be printed:

Tokenizing:

class foo;
  int m_bar;
  function foo_bar;
     print("hello world");
  endfunction
  code {
     // This is some C code }}} 
     void my_c_func() {
        printf("I have curly braces {} in a string!");
     }
  }
  function back_to_parsed_code;
  endfunction
endclass

ID                   'class'
ID                   'foo'
ANY                  ';'
ID                   'int'
ID                   'm_bar'
ANY                  ';'
ID                   'function'
ID                   'foo_bar'
ANY                  ';'
ID                   'print'
ANY                  '('
STRING               '"hello world"'
ANY                  ')'
ANY                  ';'
ID                   'endfunction'
ID                   'code'
BLOCK                '{\n     // This is some C code }}} \n     void my_c_func() {\n        printf("I have curly braces {} in a string!");\n     }\n  }'
ID                   'function'
ID                   'back_to_parsed_code'
ANY                  ';'
ID                   'endfunction'
ID                   'endclass'
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
0

You can use modes for your purpose. Take attention on two modes for CODE section. Yoy can not properly close CODE section with only one mode.

Lexer

lexer grammar Question_41355044Lexer;

CODE: 'code';
LCURLY: '{' -> pushMode(CODE_0);
WS:    [ \t\r\n] -> skip;

mode CODE_0;

CODE_0_LCURLY: '{' -> type(OTHER), pushMode(CODE_N);
RCURLY: '}' -> popMode;     // Close for LCURLY
CODE_0_OTHER: ~[{}]+ -> type(OTHER);

mode CODE_N;

CODE_N_LCURLY: '{' -> type(OTHER), pushMode(CODE_N);
CODE_N_RCURLY: '}' -> type(OTHER), popMode;
OTHER: ~[{}]+;

Parser

parser grammar Question_41355044Parser;

options { tokenVocab = Question_41355044Lexer; }

skip_code: 'code' LCURLY OTHER* RCURLY;

Input

code {
   // This is some C code
   void my_c_func() {
      printf("I have curly braces {} in a string!");
   }
}

Output tokens

CODE LCURLY({) OTHER(   // Th...) OTHER({) OTHER(      pr...) 
OTHER({) OTHER(}) OTHER( in a st...) OTHER(}) OTHER() RCURLY(}) EOF

The same approach is used for ANTLR grammar parsing itself: https://github.com/antlr/grammars-v4/tree/master/antlr4

But runtime code LexerAdaptor.py is used there instead of two-level modes.

Ivan Kochurkin
  • 4,413
  • 8
  • 45
  • 80