0

Given a positional language like the old IBM RPG, we can have a line such as

CCCCCDIDENTIFIER     E S             10

Where characters

 1-5:  comment
   6:  specification type
7-21:  identifier name
...And so on

Now, given that JFlex is based on RegExp, we would have a RegExp such as:

[a-zA-Z][a-zA-Z0-9]{0,14} {0,14}

for the identifier name token.
This RegExp however can match tokens longer than the 15 characters possible for identifier name, requiring yypushbacks.

Thus, is there a way to limit how many characters JFlex reads for a particular token?

LppEdd
  • 20,274
  • 11
  • 84
  • 139

1 Answers1

2

Regular expression based lexical analysis is really not the right tool to parse fixed-field inputs. You can just split the input into fields at the known character positions, which is way easier and a lot faster. And it doesn't require fussing with regular expressions.

Anyway, [a-zA-Z][a-zA-Z0-9]{0,14}[ ]{0,14} wouldn't be the right expression even if it did properly handle the token length, since the token is really the word at the beginning, without space characters.

In the case of fixed-length fields which contain something more complicated than a single identifier, you might want to feed the field into a lexer, using a StringReader or some such.


Although I'm sure it's not useful, here's a regular expression which matches 15 characters which start with a word and are completed with spaces:

[a-zA-Z][ ]{14} |
[a-zA-Z][a-zA-Z0-9][ ]{13} |
[a-zA-Z][a-zA-Z0-9]{2}[ ]{12} |
[a-zA-Z][a-zA-Z0-9]{3}[ ]{11} |
[a-zA-Z][a-zA-Z0-9]{4}[ ]{10} |
[a-zA-Z][a-zA-Z0-9]{5}[ ]{9} |
[a-zA-Z][a-zA-Z0-9]{6}[ ]{8} |
[a-zA-Z][a-zA-Z0-9]{7}[ ]{7} |
[a-zA-Z][a-zA-Z0-9]{8}[ ]{6} |
[a-zA-Z][a-zA-Z0-9]{9}[ ]{5} |
[a-zA-Z][a-zA-Z0-9]{10}[ ]{4} |
[a-zA-Z][a-zA-Z0-9]{11}[ ]{3} |
[a-zA-Z][a-zA-Z0-9]{12}[ ]{2} |
[a-zA-Z][a-zA-Z0-9]{13}[ ] |
[a-zA-Z][a-zA-Z0-9]{14}

(That might have to be put on one very long line.)

rici
  • 234,347
  • 28
  • 237
  • 341
  • The idea was to use JFlex to handle the two variants of RPG, positional and free form. But like you said, it's not practical in the end. I had already started reading line by line, splitting, lexing the tokens and putting them on a queue, which is polled each time the lexer advance. I'm not an expert so I was looking for a knowledgeable opinion. Thanks! – LppEdd May 17 '21 at 05:12