The question
If I only had to lex the middle part, there would be no issues at all. Unfortunately,
the initial and terminating comment are always present in the document. Any ideas on
how this could be implemented?
First solution
I would add a rule to the lexer that matches any character, as the last rule, and I would modify these functions to return a space symbol, if you are in columns 1 to 5 or beyond column 79, like so (assuming the type for space is 20):
%{
private Symbol symbol(int type) {
if((yycolumn >= 1 && yycolumn <= 5) || yycolumn > 79)
type = 20;
return new Symbol(type, yyline, yycolumn);
}
private Symbol symbol(int type, Object value) {
if((yycolumn >= 1 && yycolumn <= 5) || yycolumn > 79)
type = 20;
return new Symbol(type, yyline, yycolumn, value);
}
%}
The solution preserves column information. If you need to preserve the comments, then create a comment-characer token and return it instead of the space token.
Second solution
Or I would add two rules to the lexer one that matches the first comment in each line and returns a whitespace token of length 5:
^.....
And one that matches the second comment in each line and return a whitespace token with the length of the comment:
^(?<=...............................................................................).*
I have never used the non-capturing 'only if preceded by' with JFlex, so I don't know of it is supported. Sorry.
The solution preserves column information. Again, if you need to preserve the comments, then return a comment token, otherwise return a whitespace token.
Third solution
Or I would write two lexers, the first one replaces the first 5 characters in every line with white space (to preserve column information for the 2nd lexer) and remove the characters after column 79.
The first lexer can be written in any language OR you can use the command line tool sed (or a similar tool) to do it. Here is an example using sed:
The input to sed named input.txt:
ABCDE67890123456789012345678901234567890123456789012345678901234567890123456789FGHJKL
ABCDEThis is the text we want, not the start and not the end of the line. FGHJKL
The sed command:
sed 's/^.....\(..........................................................................\).*$/\1/' input.txt > output.txt
The output from sed named output.txt:
67890123456789012345678901234567890123456789012345678901234567890123456789
This is the text we want, not the start and not the end of the line.
You can modify the script to preserve column positions by inserting 5 spaces in the replacement part of the command, but it is not suited for returning the comments.