27

I am trying to write parsers for juniper/srx router access control lists. Below is the grammar I am using:

grammar SRXBackend;

acl:
    'security' '{' 'policies' '{' COMMENT* replaceStmt '{' policy* '}' '}' '}'
            applications
            addressBook
;

replaceStmt:
    'replace:' IDENT
|   'replace:' 'from-zone' IDENT 'to-zone' IDENT
;

policy:
    'policy' IDENT '{' 'match' '{' fromStmt* '}' 'then' (action | '{' action+ '}') '}'
;

fromStmt:
     'source-address' addrBlock                     # sourceAddrStmt
|    'destination-address' addrBlock                # destinationAddrStmt
|    'application' (srxName ';' | '[' srxName+ ']')  # applicationBlock
;

action:
    'permit' ';'
|   'deny' ';'
|   'log { session-close; }'
;

addrBlock:
    '[' srxName+ ']'
|   srxName ';'
;

applications:
    'applications' '{' application* '}'
|   'applications' '{' 'apply-groups' IDENT ';' '}' 'groups' '{' replaceStmt  '{' 'applications' '{' application* '}' '}' '}'
;

addressBook:
    'security' '{' 'address-book' '{' replaceStmt '{' addrEntry* '}' '}' '}'
|   'groups' '{' replaceStmt  '{' 'security' '{' 'address-book' '{' IDENT '{' addrEntry* '}' '}' '}' '}' '}' 'security' '{' 'apply-groups' IDENT ';' '}'
;

application:
    'replace:'? 'application' srxName '{' applicationStmt+ '}'
;

applicationStmt:
    'protocol' srxName ';'            #applicationProtocol
|   'source-port' portRange ';'       #applicationSrcPort
|   'destination-port' portRange ';'  #applicationDstPort
;

portRange:
    NUMBER             #portRangeOne
|   NUMBER '-' NUMBER  #portRangeMinMax
;

addrEntry:
    'address-set' IDENT '{' addrEntryStmt+ '}' #addrEntrySet
|   'address' srxName cidr ';'                 #addrEntrySingle
;

addrEntryStmt:
    ('address-set' | 'address') srxName ';'
;

cidr:
    NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ('/' NUMBER)?
;

srxName:
    NUMBER
|   IDENT
|   cidr
;

COMMENT : '/*' .*? '*/' ;
NUMBER  : [0-9]+ ;
IDENT   : [a-zA-Z][a-zA-Z0-9,\-_:\./]* ;
WS      : [ \t\n]+ -> skip ;

When I try to use an ACL with ~80,000 lines, it takes upto ~10 minutes to generate the parse tree. I am using following code for creating the parse tree:

from antlr4 import *
from SRXBackendLexer import SRXBackendLexer
from SRXBackendParser import SRXBackendParser
import sys


    def main(argv):
        ipt = FileStream(argv[1])
        lexer = SRXBackendLexer(ipt)
        stream = CommonTokenStream(lexer)
        parser = SRXBackendParser(stream)
        parser.acl()

    if __name__ == '__main__':
        main(sys.argv)

I am using Python 2.7 as target language. I also ran cProfile to identify which code takes most time. Below is the first few records sorted on time:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   608448   62.699    0.000  272.359    0.000 LexerATNSimulator.py:152(execATN)
  5007036   41.253    0.000   71.458    0.000 LexerATNSimulator.py:570(consume)
  5615722   32.048    0.000   70.416    0.000 DFAState.py:131(__eq__)
 11230968   24.709    0.000   24.709    0.000 InputStream.py:73(LA)
  5006814   21.881    0.000   31.058    0.000 LexerATNSimulator.py:486(captureSimState)
  5007274   20.497    0.000   29.349    0.000 ATNConfigSet.py:160(__eq__)
 10191162   18.313    0.000   18.313    0.000 {isinstance}
 10019610   16.588    0.000   16.588    0.000 {ord}
  5615484   13.331    0.000   13.331    0.000 LexerATNSimulator.py:221(getExistingTargetState)
  6832160   12.651    0.000   12.651    0.000 InputStream.py:52(index)
  5007036   10.593    0.000   10.593    0.000 InputStream.py:67(consume)
   449433    9.442    0.000  319.463    0.001 Lexer.py:125(nextToken)
        1    8.834    8.834   16.930   16.930 InputStream.py:47(_loadString)
   608448    8.220    0.000  285.163    0.000 LexerATNSimulator.py:108(match)
  1510237    6.841    0.000   10.895    0.000 CommonTokenStream.py:84(LT)
   449432    6.044    0.000  363.766    0.001 Parser.py:344(consume)
   449433    5.801    0.000    9.933    0.000 Token.py:105(__init__)

I cannot really make much sense out of it except InputStream.LA takes around half a minute. I guess this is due to the fact that the entire text string gets buffered/loaded at once. Is there any alternative/more lazy way of parsing or loading data for Python target? Is there any improvement I can make to the grammar to have the parsing faster?

Thank you

prthrokz
  • 1,120
  • 8
  • 16
  • 1
    It is not an answer, but have you tried use PyPy or anything else? Just to know how much of a burden falls on python? – Divisadero Mar 11 '16 at 10:01
  • I haven't used PyPy but did some more research since yesterday. Seems that ANTLR's input stream class converts the entire text input into a byte buffer character by character. That seems to take upto over a minute. Is there a faster way to do this? I am sure I can override input stream's implementation as long as I can find a better way of doing this. – prthrokz Mar 11 '16 at 17:38
  • 1
    @prthrokz, I'd advice you to try the "old" Antlr 3. Antlr 4 tries to parse just about every grammar but has to put in excessively much runtime effort to parse even very simple grammars. Antlr 3 is more restrictive, but fast. – Kijewski Mar 14 '16 at 18:21
  • @Kay: I read somewhere that we can fallback to ANTLR3 based SLL(*) parsing by setting a flag on the parser. Any idea if Python target supports that feature ? – prthrokz Mar 14 '16 at 18:25
  • @prthrokz, sorry, I have no idea. – Kijewski Mar 15 '16 at 16:36
  • 1
    @prthrokz Are you using 4.5.3 runtime? If not, try switching to it and see if it makes any difference. I read somewhere that previous release(s) had some bug resulting in slow parsing, and this was fixed in 4.5.3 released on March 30th 2016 on PyPi - https://pypi.python.org/pypi/antlr4-python2-runtime – shikhanshu Apr 05 '16 at 19:09
  • I had a big performance problem with an Ada parser using AntLR. Running it with pypy divided running times by 10 or more. I recommend that approach if you cannot fix it by another means. – Jean-François Fabre Jun 17 '16 at 11:02
  • Could you also post the Lexer definition? – Apalala Aug 11 '16 at 18:51
  • Not a plug... although I'm the main author. If the target language is Python, you could consider using [Grako](https://pypi.python.org/pypi/grako/). The example in *examples/antlr2grako* will translate the ANTLR grammar to Grako (e-ebnf) format (better if you include the lexical definitions in the same .g file). – Apalala Aug 11 '16 at 22:51
  • @Apalala, the tokens are at the bottom of the grammar. – aghast Feb 02 '17 at 03:40
  • Please try this change: `portRange: NUMBER ('-' NUMBER)?;` – Apalala Feb 02 '17 at 11:06
  • It is my understanding that your `IDENT` can be zero-sized due to `*` instead of `+`. This would send your parser into all sorts of funny loops for every single character... – user2722968 Apr 30 '17 at 19:07
  • Is your question still relevant? If so, do you have a sample ACL file? – Laurent LAPORTE May 12 '17 at 19:40

2 Answers2

0

It is my understanding that your IDENT can be zero-sized due to * instead of +. This sends your parser into loops for every single character, generating zero-sized IDENT nodes.

user2722968
  • 13,636
  • 2
  • 46
  • 67
  • His rule is `IDENT : [a-zA-Z][a-zA-Z0-9,\-_:\./]*`. It can't be 0 sized. It has to have a letter (either case) followed by 0 or more of the second set. (It's easy to see th trailing `*`, but miss that it's two separate sets. – Mike Cargal Jan 14 '23 at 17:28
-1

Try this:

# Install ANTLR and the Python runtime for ANTLR
!pip install antlr4-python3-runtime

# Generate a parser from the grammar
!antlr4 -Dlanguage=Python SRXBackend.g4

# Import the generated parser and lexer
from SRXBackendParser import SRXBackendParser
from SRXBackendLexer import SRXBackendLexer

# Read the input file
with open("input.txt", "r") as file:
    input_str = file.read()

# Create a lexer and parser
lexer = SRXBackendLexer(InputStream(input_str))
stream = CommonTokenStream(lexer)
parser = SRXBackendParser(stream)

# Parse the input
tree = parser.acl()
Byrek
  • 1
  • 2
  • Make sure you replace the input.txt with the grammar file – Byrek Jan 05 '23 at 18:01
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 05 '23 at 20:29