I would like to parse / store the following sequence: Ann {* some arbitrary long sequence *}
My buffer length (YYLMAX) is set to 8K. Given that the sequence can be arbitrary long, I try to do it this way:
- Search for the
Ann
token followed by{*
. Once seen go to aANN
state. - When in
ANN
state read one character at time until YYLMAX-1. Return this token to YACC which copies/appends it to a C++ string buffer (dynamically increasing). - Repeat Step(2) until
*}
is seen. At that point set state to INITIAL (0).
For small sequences (less than 8K), my code works OK. However, if the sequence exceeds 8K the first grab works OK, but because it matches the annotation_text
rule, the lexer seems to be resetting the state to INITIAL. Thus, as I continue reading the next characters in the sequence, I end up getting a parse error, because the lexer rule ANN[^*]
no longer matches. I would like to avoid this and remain in the ANN
state until the full sequence is read until *}
has been seen. What is the correct way to implement this behavior? A sample example would help.
I don't want to increase the length of YYLMAX because I do not know in advance the maximum length of the sequence inside Ann {* ... *}
block.
Thank you for any ideas.
The relevant snippet of my YACC grammar looks like this:
string curr_annotation = "{*";
annotation : _STIL_ANN _STIL_ANN_LCURLY annotation_text _STIL_ANN_RCURLY
{
curr_annotation += "*}";
std::cout << "Going to add annotation '" << curr_annotation << "'" << std::endl;
p_block->addAnn(curr_annotation,stil_yylineno);
curr_annotation = "{*";
}
;
annotation_text : _STIL_ANN_TEXT
{
std::cout << "appending to annotation" << std::endl;
curr_annotation += $1;
}
| annotation_text _STIL_ANN_TEXT
;
The relevant snippet of my LEXER looks like this:
WHITESPACE [ \t\n\r]
START_BLK_ANN {WHITESPACE}*"{*"
END_BLK_ANN "*}"
...
<ANN>{START_BLK_ANN} { printf("match1\n");
TRACE(("<ANN>START_BLK_ANN <%s> \n",yytext));
return(token(_STIL_ANN_LCURLY));
}
<ANN>{END_BLK_ANN} { printf("match2\n"); TRACE(("<ANN>END_BLK_ANN <%s> \n",yytext));
BEGIN 0;
return(token(_STIL_ANN_RCURLY));
}
<ANN>[^*] { printf("match3\n"); yylval.string = stiltok_grab_annotation();
TRACE(("ANN_TEXT <%s> \n",yylval.string));
return(token(_STIL_ANN_TEXT));
}
{IDENTIFIER} {
printf("got here\n");
temp_sym_name = yytext;
tokVal = stiltoktbl.locateToken(temp_sym_name);
yylval.string = yytext;
if (tokVal != UNDEFINED_TOKEN) {
if (tokVal == _STIL_USER_KEYWORD) {
BEGIN USERKW;
TRACE(("USER_KEYWORD <%s> \n",yytext));
}
else if (tokVal == _STIL_Include) {
TRACE(("INCLUDE \n"));
BEGIN INCLUDE;
return(token(_STIL_INCLUDE));
}
else if (tokVal == _STIL_Ann) {
TRACE(("ANN \n"));
BEGIN ANN; <== Ann Token seen switch state.
return(token(_STIL_ANN));
}
else {
TRACE(("KEYWORD <%s> \n",yytext));
}
return(tokVal);
}
else {
TRACE(("IDENTIFIER <%s> \n",yytext));
return(token(_STIL_IDENTIFIER));
}
The stiltok_grab_annotation
looks like this:
char * stiltok_grab_annotation() {
TRACEID("stiltok_grab_annotation",SEV_4);
char * p_ann_begin = yytext; // point past "{*" prefix
char * p_ann_end = p_ann_begin + 1;
char c1 = yyinput();
char c2 = yyinput();
*(p_ann_end++) = c1;
*(p_ann_end++) = c2;
while ( c1 != '*' || c2 != '}') {
c1 = c2;
c2 = yyinput();
*(p_ann_end++) = c2;
if ((p_ann_end - yytext) == (MAX_TOKEN_LENGTH-1)) {
break;
}
}
*(p_ann_end) = '\0'; // place the string terminator at end after "*}" suffix
// Return '*}' token to the lexer
if ( c1 == '*' && c2 == '}' ) {
*(p_ann_end-2) = '\0';
yyunput(c2, yytext);
yyunput(c1, yytext);
}
return p_ann_begin;
}
sample run (just the relevant pieces):
match1 <-- in ANN state, matched '{*'
Line:[58] {*
match3 <-- in ANN state, matched single character. run stiltok_grab_annotation()...
Line:[58] ...... <-- received 8K-1 characters.
appending to annotation <-- successfully copied by yacc to string buffer
got here <-- lexer has gone back to IDENTIFIER state instead of remaining in ANN state
Line:[58]D_REG55_0_18_0 <-- trouble begins
-S- Parsing error detected in file: x [stilcomyacc_perror]
-T- Syntax Error [Line: 58]