Match anything until end tag (generic text) in simple lexer/parser using ANTLR4

Question

I want to make a simple parser for a simple scripting language, it has text blocks and script blocks, inside those scriptblocks, I want to be able to define a function, as well as execute generic statements of any kind.

I don't really need to know or care what classifies as a "statement", but I do need to parse for function declarations. So even if it looks like a while loop and I don't have a rule for a while loop, can I match a "generic statement rule" and just get the content some how?

Using a catchall rule I am able to do the "generic text" part fine, but in script mode I'm less successful, I tried pulling off nested modes where I set an 'IN FUNCTION' mode, but kept running into road blocks.

For example, when inside a statement within my functionDeclaration , how can I match everything until the end function. Furthermore, how can I just match a "generic" statement, such that I do not ever need statement types like emptyStatement or assignmentStatement. Even if it just becomes a big "script code blob" that's fine with me.

Where I am so far:

My Grammar:

parser grammar ExprParser;
options { tokenVocab=ExprLexer; }

file
    : block* EOF
    ;
    
block
    : textBlock+
    | script
    ;
    
textBlock
    : HtmlDtd
    | GenericText
    | ScriptEnd
    ;
    
script
    : topStatement+
    | statement
    ;
    
topStatement
    : functionDeclaration
    ;

functionDeclaration
    : FunctionStart Ident L_PAREN R_PAREN statement* FunctionEnd
    ;


statement
    : assignmentStatement
    | emptyStatement
    ;
    
assignmentStatement
    : Ident ASSIGNTO Ident SEMICOLON
    ;
    
emptyStatement
    : SEMICOLON
    ;

My Lexer

lexer grammar ExprLexer;

channels { Comments, SkipChannel }



SeaWhitespace:  [ \t\r\n\f]+ -> channel(HIDDEN);
HtmlDtd:        '<!' .*? '>';
ScriptStart:       SCRIPT_START_FRAGMENT -> channel(SkipChannel), pushMode(SCRIPT);

// Catch all text
GenericText : . ; 

mode SCRIPT;
ScriptEnd :'%' '>' -> channel(SkipChannel), popMode;
ScriptWhitespace : [ \t\r\n\f]+ -> channel(SkipChannel);

// Comments begin with single quote
ScriptSingleLineComment:  '\'' -> channel(SkipChannel), pushMode(SingleLineCommentMode);
    
FunctionStart :  FUNCTION_START_FRAGMENT;
FunctionEnd : FUNCTION_END_FRAGMENT;
Ident : ID;

COMMA     : ',';
SEMICOLON : ';';
L_PAREN   : '(';
R_PAREN   : ')';
ASSIGNTO  : '=';

mode SingleLineCommentMode;
Comment:                 ~[\r\n?]+ -> channel(Comments);
CommentEnd:              [\r\n] -> channel(SkipChannel), popMode; // exit from comment.


// Fragments
fragment ID: [a-zA-Z0-9_\u0080-\ufffe]+;
fragment NameString: [a-zA-Z_\u0080-\ufffe][a-zA-Z0-9_\u0080-\ufffe]*;
fragment SCRIPT_START_FRAGMENT : '<%';
fragment SCRIPT_END_FRAGMENT : '%>';
fragment FUNCTION_START_FRAGMENT : 'function';
fragment FUNCTION_END_FRAGMENT : 'end function'; // Space is required here

Some test strings

<! tagsIknow >
<tagsIdontknowbutwant>
<%
function xxx() 'this is a comment 

  x = y;
  a = 1;
  ;
  ;

end function

a = 1;
b = 2;

%>
randomtext
<%

  'another script
  x = 3; 'inline comment again
%>

The kind of script I want to work with

blah
<%

function xxx() 
   while (true) ' notice I have no rule for a while loop
     get me everything in here verbatim except for comments ' this ideally is trimmed
   endwhile 
end function ' I want everything until the 'end function' keyword, basically

%>

more generic text

EDIT:

My goal is for input like this

text1
<%

arbitrary script lines1
arbitrary script lines2


function x(a,b) 
   arbitrary script body containing anything
end function

arbitrary script lines3 again
%>
plain text
<%
arbitrary script lines4 again

function y() 
    different function body
end function
%>

So I get this:

PLAIN_TEXT_BLOB (matching TEXT1)
SCRIPT_BLOB (matching script lines 1 & 2 together)
FUNCTION
  name: x
  params: [a, b]
  body: SCRIPT_BLOB (containing the body)
SCRIPT_BLOB (matching line 3)
PLAIN_TEXT_BLOB (matching 'plain text')
SCRIPT_BLOB (matching line 4)
FUNCTION
  name: y
  params: []
  body: SCRIPT_BLOB (containing the body)
EOF

So in theory just three "types", plain texts, script objects (multiple lines), and functions (which themselves contain some params and a single script object)

Such that given the above objects I can maintain order which I encountered and handle appropriately, pushing "PLAIN TEXT" out raw, running "non-function scripts" in order, and declaring functions in order.

The problem is I cannot seem to capture things like the function name or the parameters while I have a greedy rule (this is due to ANTLR overriding those rules with most greedy one), so I cannot have a rule for paramters which is confirming they fit an identifier, meanwhile having a '.+' rule to collect function body.

A compromise would be to collect the function as a whole (everything inside of function and end function) and do a second parse on that block to parse the function header (name + params), trying to avoid.

Another idea would be to have an additional mode which goes into "FUNCTION_BODY_MODE" once it encounters an R_PAREN, and pop out (twice) once it finds end function. This way, anything between R_PAREN and end function is the function's body, inside that higher level mode I can have a greedy rule.

Something like

FunctionStart:       FUNCTION_START_FRAGMENT-> channel(SkipChannel), pushMode(IN_FUNCTION);

mode IN_FUNCTION;
FunctionBodyStart:       R_PAREN_FRAGMENT -> channel(SkipChannel), pushMode(IN_FUNCTION_BODY);

mode IN_FUNCTION_BODY;
FunctionBodyAndFunctionEnd : FUNCTION_END_FRAGMENT -> channel(SkipChannel), popMode, popMode; // double pop
ALL_TEXT : . ; // will consume everything

My issue with the above is it just sounds extremely counter-intuitive, and I am very new with ANTLR parsers so just trying to get the best advice for doing what fits my purposes.

Bart Kiers · Accepted Answer · 2023-04-01T19:13:38.320

Instead of pushing modes, I'd just use mode(...) to switch to another mode. This means you need not pop modes, making it a bit easier to understand what's going on.

I'd go for something like this:

ExprLexer.g4

lexer grammar ExprLexer;

ScriptStart : '<%' -> mode(Script);
GenericText : . ;

fragment Spaces : [ \r\n\t]+;
fragment Id     : [a-zA-Z0-9_\u0080-\ufffe]+;

mode Script;

 ScriptEnd  : '%>' -> mode(DEFAULT_MODE);
 Comment    : '\'' ~[\r\n]* -> skip;
 Function   : 'function' -> mode(FunctionDeclaration);
 ScriptText : . ;

mode FunctionDeclaration;

 FunctionName      : Id;
 DeclarationSpaces : Spaces+ -> skip;
 OPar              : '(' -> mode(FunctionParameter);

mode FunctionParameter;

 ParameterName   : Id;
 ParameterSpaces : Spaces+ -> skip;
 Comma           : ',';
 CPar            : ')' -> mode(InFunction);

mode InFunction;

 EndFunction    : 'end' Spaces 'function' -> mode(Script);
 FunctionSpaces : Spaces+ -> skip;
 FunctionText   : . ;

ExprParser.g4

parser grammar ExprParser;

options { tokenVocab=ExprLexer; }

file
 : block* EOF
 ;

block
 : plainText
 | ScriptStart script* ScriptEnd
 ;

plainText
 : GenericText+
 ;

script
 : ScriptText+
 | function
 ;

function
 : Function FunctionName OPar parameters? CPar functionBody EndFunction
 ;

functionBody
 : FunctionText*
 ;

parameters
 : ParameterName ( Comma ParameterName )*
 ;

which will parse your input:

text1
<%
arbitrary script lines1
arbitrary script lines2

function x(a,b)
   arbitrary script body containing anything
end function

arbitrary script lines3 again
%>
plain text
<%
arbitrary script lines4 again

function y()
    different function body
end function
%>
MU

like this:

(file 
  (block 
    (plainText t e x t 1 \n)) 
  (block <% 
    (script \n a r b i t r a r y   s c r i p t   l i n e s 1 \n a r b i t r a r y   s c r i p t   l i n e s 2 \n \n) 
    (script 
      (function function x ( (parameters a , b) ) 
        (functionBody a r b i t r a r y s c r i p t b o d y c o n t a i n i n g a n y t h i n g) end function)) 
    (script \n \n a r b i t r a r y   s c r i p t   l i n e s 3   a g a i n \n) %>) 
  (block 
    (plainText \n p l a i n   t e x t \n)) 
  (block <% 
    (script \n a r b i t r a r y   s c r i p t   l i n e s 4   a g a i n \n \n) 
    (script 
      (function function y ( ) 
        (functionBody d i f f e r e n t f u n c t i o n b o d y) end function)) 
    (script \n) %>) 
  (block 
    (plainText \n M U)) <EOF>)

Thanks for the help, but my goal is ideally do never have a real "statement" at all, and statements don't necessarily have to be valid. Essentially, there's two cases, I want to match a function (with name, parameters) and the body's "script", and outside of a function or functions, everything else is just a script. — Nicholas, Apr 01 '23 at 15:32
I edited my original question to add some additional information and goals, thank you for your help! — Nicholas, Apr 01 '23 at 16:00
Thank you so much, this makes a lot more sense, I appreciate it! Going to give this a shot soon! — Nicholas, Apr 01 '23 at 19:10
I have been using your response to great success, but was actually now trying to modify the parser such that I can get "script lines" as well as "function body lines", basically broken up every time I encounter a newline. I was able to somewhat do it for "scripts", but the function body part is failing. — Nicholas, Apr 02 '23 at 22:15

Match anything until end tag (generic text) in simple lexer/parser using ANTLR4

1 Answers1

ExprLexer.g4

ExprParser.g4

Linked