I want to make a simple parser for a simple scripting language, it has text blocks and script blocks, inside those scriptblocks, I want to be able to define a function, as well as execute generic statements of any kind.
I don't really need to know or care what classifies as a "statement", but I do need to parse for function declarations. So even if it looks like a while loop and I don't have a rule for a while loop, can I match a "generic statement rule" and just get the content some how?
Using a catchall rule I am able to do the "generic text" part fine, but in script mode I'm less successful, I tried pulling off nested modes where I set an 'IN FUNCTION' mode, but kept running into road blocks.
For example, when inside a statement
within my functionDeclaration
, how can I match everything until the end function
. Furthermore, how can I just match a "generic" statement, such that I do not ever need statement types like emptyStatement
or assignmentStatement
. Even if it just becomes a big "script code blob" that's fine with me.
Where I am so far:
My Grammar:
parser grammar ExprParser;
options { tokenVocab=ExprLexer; }
file
: block* EOF
;
block
: textBlock+
| script
;
textBlock
: HtmlDtd
| GenericText
| ScriptEnd
;
script
: topStatement+
| statement
;
topStatement
: functionDeclaration
;
functionDeclaration
: FunctionStart Ident L_PAREN R_PAREN statement* FunctionEnd
;
statement
: assignmentStatement
| emptyStatement
;
assignmentStatement
: Ident ASSIGNTO Ident SEMICOLON
;
emptyStatement
: SEMICOLON
;
My Lexer
lexer grammar ExprLexer;
channels { Comments, SkipChannel }
SeaWhitespace: [ \t\r\n\f]+ -> channel(HIDDEN);
HtmlDtd: '<!' .*? '>';
ScriptStart: SCRIPT_START_FRAGMENT -> channel(SkipChannel), pushMode(SCRIPT);
// Catch all text
GenericText : . ;
mode SCRIPT;
ScriptEnd :'%' '>' -> channel(SkipChannel), popMode;
ScriptWhitespace : [ \t\r\n\f]+ -> channel(SkipChannel);
// Comments begin with single quote
ScriptSingleLineComment: '\'' -> channel(SkipChannel), pushMode(SingleLineCommentMode);
FunctionStart : FUNCTION_START_FRAGMENT;
FunctionEnd : FUNCTION_END_FRAGMENT;
Ident : ID;
COMMA : ',';
SEMICOLON : ';';
L_PAREN : '(';
R_PAREN : ')';
ASSIGNTO : '=';
mode SingleLineCommentMode;
Comment: ~[\r\n?]+ -> channel(Comments);
CommentEnd: [\r\n] -> channel(SkipChannel), popMode; // exit from comment.
// Fragments
fragment ID: [a-zA-Z0-9_\u0080-\ufffe]+;
fragment NameString: [a-zA-Z_\u0080-\ufffe][a-zA-Z0-9_\u0080-\ufffe]*;
fragment SCRIPT_START_FRAGMENT : '<%';
fragment SCRIPT_END_FRAGMENT : '%>';
fragment FUNCTION_START_FRAGMENT : 'function';
fragment FUNCTION_END_FRAGMENT : 'end function'; // Space is required here
Some test strings
<! tagsIknow >
<tagsIdontknowbutwant>
<%
function xxx() 'this is a comment
x = y;
a = 1;
;
;
end function
a = 1;
b = 2;
%>
randomtext
<%
'another script
x = 3; 'inline comment again
%>
The kind of script I want to work with
blah
<%
function xxx()
while (true) ' notice I have no rule for a while loop
get me everything in here verbatim except for comments ' this ideally is trimmed
endwhile
end function ' I want everything until the 'end function' keyword, basically
%>
more generic text
EDIT:
My goal is for input like this
text1
<%
arbitrary script lines1
arbitrary script lines2
function x(a,b)
arbitrary script body containing anything
end function
arbitrary script lines3 again
%>
plain text
<%
arbitrary script lines4 again
function y()
different function body
end function
%>
So I get this:
PLAIN_TEXT_BLOB (matching TEXT1)
SCRIPT_BLOB (matching script lines 1 & 2 together)
FUNCTION
name: x
params: [a, b]
body: SCRIPT_BLOB (containing the body)
SCRIPT_BLOB (matching line 3)
PLAIN_TEXT_BLOB (matching 'plain text')
SCRIPT_BLOB (matching line 4)
FUNCTION
name: y
params: []
body: SCRIPT_BLOB (containing the body)
EOF
So in theory just three "types", plain texts, script objects (multiple lines), and functions (which themselves contain some params and a single script object)
Such that given the above objects I can maintain order which I encountered and handle appropriately, pushing "PLAIN TEXT" out raw, running "non-function scripts" in order, and declaring functions in order.
The problem is I cannot seem to capture things like the function name or the parameters while I have a greedy rule (this is due to ANTLR overriding those rules with most greedy one), so I cannot have a rule for paramters which is confirming they fit an identifier, meanwhile having a '.+' rule to collect function body.
A compromise would be to collect the function as a whole (everything inside of function
and end function
) and do a second parse on that block to parse the function header (name + params), trying to avoid.
Another idea would be to have an additional mode which goes into "FUNCTION_BODY_MODE" once it encounters an R_PAREN
, and pop out (twice) once it finds end function
. This way, anything between R_PAREN and end function
is the function's body, inside that higher level mode I can have a greedy rule.
Something like
FunctionStart: FUNCTION_START_FRAGMENT-> channel(SkipChannel), pushMode(IN_FUNCTION);
mode IN_FUNCTION;
FunctionBodyStart: R_PAREN_FRAGMENT -> channel(SkipChannel), pushMode(IN_FUNCTION_BODY);
mode IN_FUNCTION_BODY;
FunctionBodyAndFunctionEnd : FUNCTION_END_FRAGMENT -> channel(SkipChannel), popMode, popMode; // double pop
ALL_TEXT : . ; // will consume everything
My issue with the above is it just sounds extremely counter-intuitive, and I am very new with ANTLR parsers so just trying to get the best advice for doing what fits my purposes.