JFlex - lex structured document

Question

The following image represents code that I need to lex.
The document has the following format (in columns term):

 1 -  5:     comment
 6 - 79:     actual code
80 - ..:     comment

If I only had to lex the middle part, there would be no issues at all.
Unfortunately, the initial and terminating comment are always present in the document.

Any ideas on how this could be implemented?
I was thinking about implementing a two-phases lexer, but my thoughts are a bit confused still.

You don't need a lexer for this, just a column-based splitter. — user207421, Aug 28 '21 at 10:07
@user207421 mmh, could you expand a bit? I mean, I need a lexer to lex the middle part of the source. — LppEdd, Aug 28 '21 at 10:09
@user207421 the beginning and ending comment must be included in the lexed tokens obviously. Just to be clear. — LppEdd, Aug 28 '21 at 10:12
It's not obvious to me. I've implemented a number of column-based languages, such as Cobol for a start, and we always threw the label and comment fields away. They aren't part of the grammar. What's different about this case? — user207421, Aug 28 '21 at 11:49
@user207421 sorry if I wasn't sufficiently clear, my bad. The lexer will be used to provide support for the RPG language inside IntelliJ IDEA. Thus I need to account for all tokens. — LppEdd, Aug 28 '21 at 12:18

GoWiser · Answer 1 · 2021-08-29T01:49:16.633

1

The question

If I only had to lex the middle part, there would be no issues at all. Unfortunately,
the initial and terminating comment are always present in the document. Any ideas on
how this could be implemented?

First solution

I would add a rule to the lexer that matches any character, as the last rule, and I would modify these functions to return a space symbol, if you are in columns 1 to 5 or beyond column 79, like so (assuming the type for space is 20):

%{
    private Symbol symbol(int type) {
        if((yycolumn >= 1 && yycolumn <= 5) || yycolumn > 79)
            type = 20;
        return new Symbol(type, yyline, yycolumn);
    }

    private Symbol symbol(int type, Object value) {
        if((yycolumn >= 1 && yycolumn <= 5) || yycolumn > 79)
            type = 20;
        return new Symbol(type, yyline, yycolumn, value);
    }
%}

The solution preserves column information. If you need to preserve the comments, then create a comment-characer token and return it instead of the space token.

Second solution

Or I would add two rules to the lexer one that matches the first comment in each line and returns a whitespace token of length 5:

^.....

And one that matches the second comment in each line and return a whitespace token with the length of the comment:

^(?<=...............................................................................).*

I have never used the non-capturing 'only if preceded by' with JFlex, so I don't know of it is supported. Sorry.

The solution preserves column information. Again, if you need to preserve the comments, then return a comment token, otherwise return a whitespace token.

Third solution

Or I would write two lexers, the first one replaces the first 5 characters in every line with white space (to preserve column information for the 2nd lexer) and remove the characters after column 79.

The first lexer can be written in any language OR you can use the command line tool sed (or a similar tool) to do it. Here is an example using sed:

The input to sed named input.txt:

ABCDE67890123456789012345678901234567890123456789012345678901234567890123456789FGHJKL
ABCDEThis is the text we want, not the start and not the end of the line.      FGHJKL

The sed command:

sed  's/^.....\(..........................................................................\).*$/\1/' input.txt > output.txt

The output from sed named output.txt:

67890123456789012345678901234567890123456789012345678901234567890123456789
This is the text we want, not the start and not the end of the line.

You can modify the script to preserve column positions by inserting 5 spaces in the replacement part of the command, but it is not suited for returning the comments.

edited Aug 29 '21 at 01:49

answered Aug 28 '21 at 10:08

GoWiser

857
6
20

Using a two-phase lexer, one has to merge tokens afterwards, however. – LppEdd Aug 28 '21 at 10:21
Must the comments be passed from the lexer to the parser? Otherwise you can just let the first lexer (that you do not write with jflex, but in plain old java) replace them with whitespace (to preserve column indexes). – GoWiser Aug 28 '21 at 10:32
I don't need comments to be passed to the parser, but comments must produce specific tokens with index ranges (start-end), as I need them for highlighting purposes. – LppEdd Aug 28 '21 at 10:35
Then replace the first 5 columns with whitespace to preserve column positions for highlighting. I'll update the answer. – GoWiser Aug 28 '21 at 10:36
That means I'd have to pass the document two times, one to replace start and end comments, and one to lex the code, right? – LppEdd Aug 28 '21 at 10:43
Or am I misunderstanding you? The comments are not real comments? – GoWiser Aug 28 '21 at 10:43
The comments are real comments. Check out https://en.m.wikipedia.org/wiki/IBM_RPG – LppEdd Aug 28 '21 at 10:44
"That means I'd have to pass the document two times, one to replace start and end comments, and one to lex the code?" - Yes unless you use the first single-lexer approach where you modify the **symbol()* functions. – GoWiser Aug 28 '21 at 10:48
If you are not required to use jflex, I would just write the lexer in plain old java or whatever language you want. – GoWiser Aug 28 '21 at 10:52
I have already written a lexer for the fixed format version of that language, but writing one for the C-like version is an enormous task which I really don't want to tackle. – LppEdd Aug 28 '21 at 10:53
Thanks btw, it's diffult to understand how to lex that language when you're out of context. So don't worry about it. – LppEdd Aug 28 '21 at 10:56
You can also add two rules to a single lexer one that matches the first 5 characters, using the "^....." and one that matches characters, but only if preceeded by 79 characters like "^(?<=...............................................................................).\*". I have never used the non-capturing 'only if preceded by', so I don't know of jflex supports it. NOTE: There should be no line break in the second regular expression, stack overflow formatting inserts a break on my end. – GoWiser Aug 28 '21 at 11:09
Does the solution answer your question? – GoWiser Aug 29 '21 at 01:41

JFlex - lex structured document

1 Answers1