ANTLR4 Python parsing big files

Question

I am trying to write parsers for juniper/srx router access control lists. Below is the grammar I am using:

grammar SRXBackend;

acl:
    'security' '{' 'policies' '{' COMMENT* replaceStmt '{' policy* '}' '}' '}'
            applications
            addressBook
;

replaceStmt:
    'replace:' IDENT
|   'replace:' 'from-zone' IDENT 'to-zone' IDENT
;

policy:
    'policy' IDENT '{' 'match' '{' fromStmt* '}' 'then' (action | '{' action+ '}') '}'
;

fromStmt:
     'source-address' addrBlock                     # sourceAddrStmt
|    'destination-address' addrBlock                # destinationAddrStmt
|    'application' (srxName ';' | '[' srxName+ ']')  # applicationBlock
;

action:
    'permit' ';'
|   'deny' ';'
|   'log { session-close; }'
;

addrBlock:
    '[' srxName+ ']'
|   srxName ';'
;

applications:
    'applications' '{' application* '}'
|   'applications' '{' 'apply-groups' IDENT ';' '}' 'groups' '{' replaceStmt  '{' 'applications' '{' application* '}' '}' '}'
;

addressBook:
    'security' '{' 'address-book' '{' replaceStmt '{' addrEntry* '}' '}' '}'
|   'groups' '{' replaceStmt  '{' 'security' '{' 'address-book' '{' IDENT '{' addrEntry* '}' '}' '}' '}' '}' 'security' '{' 'apply-groups' IDENT ';' '}'
;

application:
    'replace:'? 'application' srxName '{' applicationStmt+ '}'
;

applicationStmt:
    'protocol' srxName ';'            #applicationProtocol
|   'source-port' portRange ';'       #applicationSrcPort
|   'destination-port' portRange ';'  #applicationDstPort
;

portRange:
    NUMBER             #portRangeOne
|   NUMBER '-' NUMBER  #portRangeMinMax
;

addrEntry:
    'address-set' IDENT '{' addrEntryStmt+ '}' #addrEntrySet
|   'address' srxName cidr ';'                 #addrEntrySingle
;

addrEntryStmt:
    ('address-set' | 'address') srxName ';'
;

cidr:
    NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ('/' NUMBER)?
;

srxName:
    NUMBER
|   IDENT
|   cidr
;

COMMENT : '/*' .*? '*/' ;
NUMBER  : [0-9]+ ;
IDENT   : [a-zA-Z][a-zA-Z0-9,\-_:\./]* ;
WS      : [ \t\n]+ -> skip ;

When I try to use an ACL with ~80,000 lines, it takes upto ~10 minutes to generate the parse tree. I am using following code for creating the parse tree:

from antlr4 import *
from SRXBackendLexer import SRXBackendLexer
from SRXBackendParser import SRXBackendParser
import sys


    def main(argv):
        ipt = FileStream(argv[1])
        lexer = SRXBackendLexer(ipt)
        stream = CommonTokenStream(lexer)
        parser = SRXBackendParser(stream)
        parser.acl()

    if __name__ == '__main__':
        main(sys.argv)

I am using Python 2.7 as target language. I also ran cProfile to identify which code takes most time. Below is the first few records sorted on time:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   608448   62.699    0.000  272.359    0.000 LexerATNSimulator.py:152(execATN)
  5007036   41.253    0.000   71.458    0.000 LexerATNSimulator.py:570(consume)
  5615722   32.048    0.000   70.416    0.000 DFAState.py:131(__eq__)
 11230968   24.709    0.000   24.709    0.000 InputStream.py:73(LA)
  5006814   21.881    0.000   31.058    0.000 LexerATNSimulator.py:486(captureSimState)
  5007274   20.497    0.000   29.349    0.000 ATNConfigSet.py:160(__eq__)
 10191162   18.313    0.000   18.313    0.000 {isinstance}
 10019610   16.588    0.000   16.588    0.000 {ord}
  5615484   13.331    0.000   13.331    0.000 LexerATNSimulator.py:221(getExistingTargetState)
  6832160   12.651    0.000   12.651    0.000 InputStream.py:52(index)
  5007036   10.593    0.000   10.593    0.000 InputStream.py:67(consume)
   449433    9.442    0.000  319.463    0.001 Lexer.py:125(nextToken)
        1    8.834    8.834   16.930   16.930 InputStream.py:47(_loadString)
   608448    8.220    0.000  285.163    0.000 LexerATNSimulator.py:108(match)
  1510237    6.841    0.000   10.895    0.000 CommonTokenStream.py:84(LT)
   449432    6.044    0.000  363.766    0.001 Parser.py:344(consume)
   449433    5.801    0.000    9.933    0.000 Token.py:105(__init__)

I cannot really make much sense out of it except InputStream.LA takes around half a minute. I guess this is due to the fact that the entire text string gets buffered/loaded at once. Is there any alternative/more lazy way of parsing or loading data for Python target? Is there any improvement I can make to the grammar to have the parsing faster?

Thank you

It is not an answer, but have you tried use PyPy or anything else? Just to know how much of a burden falls on python? — Divisadero, Mar 11 '16 at 10:01
I haven't used PyPy but did some more research since yesterday. Seems that ANTLR's input stream class converts the entire text input into a byte buffer character by character. That seems to take upto over a minute. Is there a faster way to do this? I am sure I can override input stream's implementation as long as I can find a better way of doing this. — prthrokz, Mar 11 '16 at 17:38
@prthrokz, I'd advice you to try the "old" Antlr 3. Antlr 4 tries to parse just about every grammar but has to put in excessively much runtime effort to parse even very simple grammars. Antlr 3 is more restrictive, but fast. — Kijewski, Mar 14 '16 at 18:21
@Kay: I read somewhere that we can fallback to ANTLR3 based SLL(*) parsing by setting a flag on the parser. Any idea if Python target supports that feature ? — prthrokz, Mar 14 '16 at 18:25
@prthrokz Are you using 4.5.3 runtime? If not, try switching to it and see if it makes any difference. I read somewhere that previous release(s) had some bug resulting in slow parsing, and this was fixed in 4.5.3 released on March 30th 2016 on PyPi - https://pypi.python.org/pypi/antlr4-python2-runtime — shikhanshu, Apr 05 '16 at 19:09
I had a big performance problem with an Ada parser using AntLR. Running it with pypy divided running times by 10 or more. I recommend that approach if you cannot fix it by another means. — Jean-François Fabre, Jun 17 '16 at 11:02
Not a plug... although I'm the main author. If the target language is Python, you could consider using [Grako](https://pypi.python.org/pypi/grako/). The example in *examples/antlr2grako* will translate the ANTLR grammar to Grako (e-ebnf) format (better if you include the lexical definitions in the same .g file). — Apalala, Aug 11 '16 at 22:51
It is my understanding that your `IDENT` can be zero-sized due to `*` instead of `+`. This would send your parser into all sorts of funny loops for every single character... — user2722968, Apr 30 '17 at 19:07
Is your question still relevant? If so, do you have a sample ACL file? — Laurent LAPORTE, May 12 '17 at 19:40

score 0 · Answer 1 · answered Jul 22 '17 at 21:44

0

It is my understanding that your IDENT can be zero-sized due to * instead of +. This sends your parser into loops for every single character, generating zero-sized IDENT nodes.

answered Jul 22 '17 at 21:44

user2722968

13,636
2
46
67

His rule is `IDENT : [a-zA-Z][a-zA-Z0-9,\-_:\./]*`. It can't be 0 sized. It has to have a letter (either case) followed by 0 or more of the second set. (It's easy to see th trailing `*`, but miss that it's two separate sets. – Mike Cargal Jan 14 '23 at 17:28

score -1 · Answer 2 · answered Jan 05 '23 at 17:58

-1

Try this:

# Install ANTLR and the Python runtime for ANTLR
!pip install antlr4-python3-runtime

# Generate a parser from the grammar
!antlr4 -Dlanguage=Python SRXBackend.g4

# Import the generated parser and lexer
from SRXBackendParser import SRXBackendParser
from SRXBackendLexer import SRXBackendLexer

# Read the input file
with open("input.txt", "r") as file:
    input_str = file.read()

# Create a lexer and parser
lexer = SRXBackendLexer(InputStream(input_str))
stream = CommonTokenStream(lexer)
parser = SRXBackendParser(stream)

# Parse the input
tree = parser.acl()

answered Jan 05 '23 at 17:58

Byrek

1
2

Make sure you replace the input.txt with the grammar file – Byrek Jan 05 '23 at 18:01
1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 05 '23 at 20:29

ANTLR4 Python parsing big files

2 Answers2