ANTLR4 breaking rules down for logical generic lines

Question

This is a follow up question to this question answered perfectly by Bart

My goal is possibly to get specific lines for either "generic script lines" or "lines inside a function body", ideally discarding whitespace, but still get any lines outside of the <% and %> tags in bulk. I came up with a solution, but looking at the tree it just seems messy.

Here is my lexer:

lexer grammar CmScriptLexer;

//Whitespace:  Spaces -> channel(HIDDEN);
ScriptStart : '<%' (Spaces)* -> mode(Script);
SpacesPlain : [\r\n]+ -> skip;
GenericText : . ;

mode Script;

 ScriptEnd  : '%>' -> mode(DEFAULT_MODE);
 Comment    : '\'' ~[\r\n]* -> skip;
 Function   : 'function' -> mode(FunctionDeclaration);
 NL : [\r\n]+;
 ScriptText : . ;

mode FunctionDeclaration;
 FunctionComment    : '\'' ~[\r\n]* -> skip;
 FunctionName      : Id;
 DeclarationSpaces : Spaces+ -> skip;
 OPar              : '(' -> mode(FunctionParameter);

mode FunctionParameter;
 FunctionParameterComment    : '\'' ~[\r\n]* -> skip;
 ParameterName   : Id;
 ParameterSpaces : Spaces+ -> skip;
 Comma           : ',';
 CPar            : ')' -> mode(InFunction);

mode InFunction;
 FunctionBodyComment    : '\'' ~[\r\n]* -> skip;
 EndFunction    : 'end' Spaces 'function' -> mode(Script);
 FunctionLine : ~[ \r\n]+;
 FunctionSpaces : Spaces+;
 //FunctionText   : . ;

fragment Spaces : [ \r\n\t]+;
fragment Id     : [a-zA-Z0-9_\u0080-\ufffe]+;

and my parser:

parser grammar CmScriptParser;

options { tokenVocab=CmScriptLexer; }

file
 : block* EOF
 ;

block
 : plainText
 | ScriptStart script* ScriptEnd
 ;

plainText
 : GenericText+ NL*
 ;

script
 : simpleScript NL*
 | function NL*
 ;

simpleScript
 : ScriptText+ 
 ;

function
 : Function FunctionName OPar parameters? CPar functionBody EndFunction
 ;

functionBody
 : functionLines+
 ;

functionLines
 : FunctionSpaces* functionLine FunctionSpaces*
 ;

functionLine
 : FunctionLine+
 ;

parameters
 : ParameterName ( Comma ParameterName )*
 ;

and finally what I'm using as a test case:

foo

bar
<%
line 1


line 2 
 
function x(y)
  spanning
  multiple
  lines
end function

function a(b)    no newlines         end function


  %>     
baz

My issue is it seems really verbose and I fear my "solution" while with the test case is just poorly laid out and I'm maybe overthinking rules.

Any suggestion on how to improve? All I want is trimmed "line" elements so matching something like \n \n\n\tscript line \n\n\t\n being resulted in a line of just script line is ideal.

EDIT: adding what I think is an example of what I am after, again, maybe not expressing the best way possible:

simpleScript:
  scriptLine: line1
  scriptLine: line2
function: 
  name: x
  parameters:
     paramter: y
  body:
    functionLine: spanning
    functionLine: multiple
    functionLine: lines
function: 
  name: a
  parameters:
     paramter: b
  body:
    functionLine: no newlines

The goal in the end is when walking the tree, I can make a new "function call object", and call stuff like

script = new Script() // on script "enter"
script.addLine("line 1")
script.addLine("line 2")
program.addNode(script) // on script "exit"
...
function = new Function() // on function "enter"
function.setName("y") // on "function"?
...
function.addParameter("a") // on "parameter"
...
function.addBodyLine("spanning") // on "line" ??
function.addBodyLine("multiple")
function.addBodyLine("lines")
...
program.addFunctionDeclaration(function) // on function "exit" once complete

Can you manually create an (ascii) image of the parse tree you're after? — Bart Kiers, Apr 03 '23 at 16:53
I may not know the best notation but I'll give it a shot, obviously how I express it may not be what is logically best way to. — Nicholas, Apr 03 '23 at 17:48
Added what I could including maybe a better representation of my goal, I am still learning and understanding ANTLR so my explanations may not be the best, I apologize. I also understand the proper way is every thing being converted to a statement, but unfortunately I'm bound currently by having essentially "addLine" methods I can call. — Nicholas, Apr 03 '23 at 18:02

score 1 · Accepted Answer · answered Apr 03 '23 at 18:53

The problem is that inside a script, you cannot simply tell the grammar to match some non-space followed by everything except line breaks. Sure, that would match line 1, but that would also match function x(y) because the lexer matches greedily (it tries to consume as many characters as possible). You must therefor chop up the tokens on white spaces.

You could merge some single char tokens using ~[ \t\r\n]+, but you cannot create tokens that cause multiple words with spaces in between to be matched as single tokens.

Something like this:

lexer grammar CmScriptLexer;

ScriptStart : '<%' Spaces* -> mode(Script);
GenericText : ~[ \t\r\n]+;
TextSpaces  : Spaces -> skip;

mode Script;
 ScriptEnd   : '%>' -> mode(DEFAULT_MODE);
 Comment     : '\'' ~[\r\n]* -> skip;
 Function    : 'function' -> mode(FunctionDeclaration);
 NL          : [\r\n]+;
 ScriptText  : ~[ \t\r\n]+;
 SciptSpaces : Spaces -> skip;

mode FunctionDeclaration;
 FunctionComment   : '\'' ~[\r\n]* -> skip;
 FunctionName      : Id;
 DeclarationSpaces : Spaces+ -> skip;
 OPar              : '(' -> mode(FunctionParameter);

mode FunctionParameter;
 FunctionParameterComment : '\'' ~[\r\n]* -> skip;
 ParameterName            : Id;
 ParameterSpaces          : Spaces+ -> skip;
 Comma                    : ',';
 CPar                     : ')' -> mode(InFunction);

mode InFunction;
 FunctionBodyComment : '\'' ~[\r\n]* -> skip;
 EndFunction         : 'end' Spaces 'function' -> mode(Script);
 FunctionLine        : ~[ \t\r\n]+;
 FunctionSpaces      : Spaces+ -> skip;

fragment Spaces : [ \r\n\t]+;
fragment Id     : [a-zA-Z0-9_\u0080-\ufffe]+;

Hi Bart, thanks again for the help. The problem with this is I don't get individual line entries for stuff inside the function body, it's all clubbed together as one. Are there any suggestions for how to modify my parser accordingly? — Nicholas, Apr 03 '23 at 20:26
If you remove the parser rule `functionLine` and just do: `functionLines : FunctionSpaces* FunctionLine FunctionSpaces* ;`, it seems to result in separate lines. — Bart Kiers, Apr 03 '23 at 20:39

ANTLR4 breaking rules down for logical generic lines

1 Answers1