1

I'm trying to learn ANTLR4 and I'm already having some issues with my first experiment.

The goal here is to learn how to use ANTLR to syntax highlight a QScintilla component. To practice a little bit I've decided I'd like to learn how to properly highlight *.ini files.

First things first, in order to run the mcve you'll need:

  • Download antlr4 and make sure it works, read the instructions on the main site
  • Install python antlr runtime, just do: pip install antlr4-python3-runtime
  • Generate the lexer/parser of ini.g4:

    grammar ini;
    
    start : section (option)*;
    section : '[' STRING ']';
    option : STRING '=' STRING;
    
    COMMENT : ';'  ~[\r\n]*;
    STRING  : [a-zA-Z0-9]+;
    WS      : [ \t\n\r]+;
    

by running antlr ini.g4 -Dlanguage=Python3 -o ini

  • Finally, save main.py:

    import textwrap
    
    from PyQt5.Qt import *
    from PyQt5.Qsci import QsciScintilla, QsciLexerCustom
    
    from antlr4 import *
    from ini.iniLexer import iniLexer
    from ini.iniParser import iniParser
    
    
    class QsciIniLexer(QsciLexerCustom):
    
        def __init__(self, parent=None):
            super().__init__(parent=parent)
    
            lst = [
                {'bold': False, 'foreground': '#f92472', 'italic': False},  # 0 - deeppink
                {'bold': False, 'foreground': '#e7db74', 'italic': False},  # 1 - khaki (yellowish)
                {'bold': False, 'foreground': '#74705d', 'italic': False},  # 2 - dimgray
                {'bold': False, 'foreground': '#f8f8f2', 'italic': False},  # 3 - whitesmoke
            ]
            style = {
                "T__0": lst[3],
                "T__1": lst[3],
                "T__2": lst[3],
                "COMMENT": lst[2],
                "STRING": lst[0],
                "WS": lst[3],
            }
    
            for token in iniLexer.ruleNames:
                token_style = style[token]
    
                foreground = token_style.get("foreground", None)
                background = token_style.get("background", None)
                bold = token_style.get("bold", None)
                italic = token_style.get("italic", None)
                underline = token_style.get("underline", None)
                index = getattr(iniLexer, token)
    
                if foreground:
                    self.setColor(QColor(foreground), index)
                if background:
                    self.setPaper(QColor(background), index)
    
        def defaultPaper(self, style):
            return QColor("#272822")
    
        def language(self):
            return self.lexer.grammarFileName
    
        def styleText(self, start, end):
            view = self.editor()
            code = view.text()
            lexer = iniLexer(InputStream(code))
            stream = CommonTokenStream(lexer)
            parser = iniParser(stream)
    
            tree = parser.start()
            print('parsing'.center(80, '-'))
            print(tree.toStringTree(recog=parser))
    
            lexer.reset()
            self.startStyling(0)
            print('lexing'.center(80, '-'))
            while True:
                t = lexer.nextToken()
                print(lexer.ruleNames[t.type-1], repr(t.text))
                if t.type != -1:
                    len_value = len(t.text)
                    self.setStyling(len_value, t.type)
                else:
                    break
    
        def description(self, style_nr):
            return str(style_nr)
    
    
    if __name__ == '__main__':
        app = QApplication([])
        v = QsciScintilla()
        lexer = QsciIniLexer(v)
        v.setLexer(lexer)
        v.setText(textwrap.dedent("""\
            ; Comment outside
    
            [section s1]
            ; Comment inside
            a = 1
            b = 2
    
            [section s2]
            c = 3 ; Comment right side
            d = e
        """))
        v.show()
        app.exec_()
    

and run it, if everything went well you should get this outcome:

showcase

Here's my questions:

  • As you can see, the outcome of the demo is far away from being usable, you definitely don't want that, it's really disturbing. Instead, you'd like to get a similar behaviour than all IDEs out there. Unfortunately I don't know how to achieve that, how would you modify the snippet providing such a behaviour?
  • Right now I'm trying to mimick a similar highlighting than the below snapshot:

enter image description here

you can see on that screenshot the highlighting is different on variable assignments (variable=deeppink and values=yellowish) but I don't know how to achieve that, I've tried using this slightly modified grammar:

grammar ini;

start : section (option)*;
section : '[' STRING ']';
option : VARIABLE '=' VALUE;

COMMENT : ';'  ~[\r\n]*;
VARIABLE  : [a-zA-Z0-9]+;
VALUE  : [a-zA-Z0-9]+;
WS      : [ \t\n\r]+;

and then changing the styles to:

style = {
    "T__0": lst[3],
    "T__1": lst[3],
    "T__2": lst[3],
    "COMMENT": lst[2],
    "VARIABLE": lst[0],
    "VALUE": lst[1],
    "WS": lst[3],
}

but if you look at the lexing output you'll see there won't be distinction between VARIABLE and VALUES because order precedence in the ANTLR grammar. So my question is, how would you modify the grammar/snippet to achieve such visual appearance?

BPL
  • 9,632
  • 9
  • 59
  • 117

3 Answers3

1

The problem is that the lexer needs to be context sensitive: everything on the left hand side of the = needs to be a variable, and to the right of it a value. You can do this by using ANTLR's lexical modes. You start off by classifying successive non-spaces as being a variable, and when encountering a =, you move into your value-mode. When inside the value-mode, you pop out of this mode whenever you encounter a line break.

Note that lexical modes only work in a lexer grammar, not the combined grammar you now have. Also, for syntax highlighting, you probably only need the lexer.

Here's a quick demo of how this could work (stick it in a file called IniLexer.g4):

lexer grammar IniLexer;

SECTION
 : '[' ~[\]]+ ']'
 ;

COMMENT
 : ';' ~[\r\n]*
 ;

ASSIGN
 : '=' -> pushMode(VALUE_MODE)
 ;

KEY
 : ~[ \t\r\n]+
 ;

SPACES
 : [ \t\r\n]+ -> skip
 ;

UNRECOGNIZED
 : .
 ;

mode VALUE_MODE;

  VALUE_MODE_SPACES
   : [ \t]+ -> skip
   ;

  VALUE
   : ~[ \t\r\n]+
   ;

  VALUE_MODE_COMMENT
   : ';' ~[\r\n]* -> type(COMMENT)
   ;

  VALUE_MODE_NL
   : [\r\n]+ -> skip, popMode
   ;

If you now run the following script:

source = """
; Comment outside

[section s1]
; Comment inside
a = 1
b = 2

[section s2]
c = 3 ; Comment right side
d = e
"""

lexer = IniLexer(InputStream(source))
stream = CommonTokenStream(lexer)
stream.fill()

for token in stream.tokens[:-1]:
    print("{0:<25} '{1}'".format(IniLexer.symbolicNames[token.type], token.text))

you will see the following output:

COMMENT                   '; Comment outside'
SECTION                   '[section s1]'
COMMENT                   '; Comment inside'
KEY                       'a'
ASSIGN                    '='
VALUE                     '1'
KEY                       'b'
ASSIGN                    '='
VALUE                     '2'
SECTION                   '[section s2]'
KEY                       'c'
ASSIGN                    '='
VALUE                     '3'
COMMENT                   '; Comment right side'
KEY                       'd'
ASSIGN                    '='
VALUE                     'e'

And an accompanying parser grammar could look like this:

parser grammar IniParser;

options {
  tokenVocab=IniLexer;
}

sections
 : section* EOF
 ;

section
 : COMMENT
 | SECTION section_atom*
 ;

section_atom
 : COMMENT
 | KEY ASSIGN VALUE
 ;

which would parse your example input in the following parse tree:

enter image description here

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Really cool answer with new topics I didn't know, thanks, I'll test it out. Btw, you suggest using a lexer instead of a parser but eventually I'd like to achieve something like [this](https://www.youtube.com/watch?v=Jes3bD6P0To), check from 6:20 to 12:30 in my real case (which is a GLSL IDE). In any case, what about the other part of my question? How do you deal with errors so the highlighting doesn't become screwed up? +1 in the meantime – BPL Jun 09 '19 at 09:05
  • "but eventually I'd like to achieve [...]", OK, then just a lexer isn't going to cut it, and yes, you do need a parser. About the second part of your question, I can't give a meaningful answer to that: I've never used ANTLR in such a way (incremental parsing for IDE plugins/tools) – Bart Kiers Jun 09 '19 at 12:06
  • Interesting talk, btw. – Bart Kiers Jun 09 '19 at 12:09
  • Indeed, really nice talk! I must to say it was actually quite difficult to me pick up between antlr4 or tree-sitter tbh, both tools are pretty awesome. Anyway, I think your answer pretty much satisify my current question, I've already check it out and it works fine. Now it's time to me to adjust my trivial hello world snippet to use a parser instead a lexer, I'll do that before trying to use a more complex grammar like GLSL. Plus... not sure how difficult would be applying these Lexical modes to complex grammars like GLSL, time to check ;) – BPL Jun 09 '19 at 12:17
1

I already implemented something like this in C++.

https://github.com/tora-tool/tora/blob/master/src/editor/tosqltext.cpp

Sub-classed QScintilla class and implemented custom Lexer based on ANTLR generated source.

You might even use ANTLR parser (I did not use it), QScitilla allows you to have more than one analyzer (having different weight), so you can periodically perform some semantic check on text. What can not be done easily in QScintilla is to associate token with some additional data.

ibre5041
  • 4,903
  • 1
  • 20
  • 35
  • Wow, so you've had this idea as well, awesome, I'll take a look... about using the ANTLR parser, not sure about the c++ antlr runtime, probably is much faster than the python one. Thing is, yesterday I've tried to parse 28kb of commented glsl code with a glsl antlr parser and it took me 1.9s! that's just crazy and definitely you can't use it on realtime (parse on each keystroke)... where the parsing time should be ~50-100ms – BPL Jun 10 '19 at 14:00
  • I use c++ runtime for ANTLR3, the parsing is running in background thread and Qscintilla usually sends just one line of text to be parsed. So I had to implement some hacks for multi-line comments. – ibre5041 Jun 10 '19 at 15:54
0

Syntax highlighting in Sctintilla is done by dedicated highlighter classes, which are lexers. A parser is not well suited for such kind of work, because the syntax highlighting feature must work, even if the input contains errors. A parser is a tool to verify the correctness of the input - 2 totally different tasks.

So I recommend you stop thinking about using ANTLR4 for that and just take one of the existing Lex classes and create a new one for the language you want to highlight.

Mike Lischke
  • 48,925
  • 16
  • 119
  • 181
  • I've been using 2 days ANTLR4 and I think it's the right tool for the job here... And I'm telling this after having using QScintilla with {builtin Scintilla lexers, pygments, syntect, pyparsing, lark}. So it's not like I'm picking up ANTLR4 out of the blue... actually I was considering either ANTLR4 or tree-sitter but I pick the former mainly because the large number of existing available grammars. You say "for the language" you want to highlight... well, in the real case I'm coding few IDEs, one of them a GLSL IDE but the other one a multi-language text editor, so... – BPL Jun 09 '19 at 09:09
  • Also, I can see in this other [question](https://stackoverflow.com/a/44621880/3809375) you'd also recommended to use a lexer instead a parser and the guy decided to go with a parser. Well, to me the most important will be performance so first I need to check how long it'd take to parse GLSL files of ~30kb... Probably my decission will be based on those measurements as parsing/keystroke shouldn't be larger than ~100ms – BPL Jun 09 '19 at 09:22