I'm attempting to write a lexer for Fitnesse using JFlex and am having trouble with WikiWords (http://fitnesse.org/FitNesse.UserGuide.WikiWord)
I copied over the regex linked and am using the following regex for tokens:
. # Regular character
[A-Z]([a-z0-9]+[A-Z][a-z0-9]*)+ # WikiWord
I'm having trouble properly lexing ThisIsNotAWikiWord
though. It has 2 capitals in a row, so it should not be considered a regular word. So I need to add a lookahead to check if the next character is a letter or digit. Something like [A-Z]([a-z0-9]+[A-Z][a-z0-9]*)+ / [^A-Za-z0-9]
.
This works fine for lexing ThisIsNotAWikiWord
, but it breaks lexing WikiWords in general. When lexing WikiWord
, there is no extra character for the lookahead, so it doesn't match.
I think I want an optional lookahead. if there is a character after this, then it better not be one of these. But if there isn't another character in the input, let's match.
The documentation leads me to believe this isn't possible, but I'm hoping it's just my lack of regex-fu. From the docs:
In a lexical rule, a regular expression r may be followed by a look-ahead expression. A look-ahead expression is either a '$' (the end of line operator) or a '/' followed by an arbitrary regular expression. In both cases the look-ahead is not consumed and not included in the matched text region, but it is considered while determining which rule has the longest match (see also 4.3.3 How the input is matched).
In the '$' case r is only matched at the end of a line in the input. The end of a line is denoted by the regular expression \r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085. So a$ is equivalent to a / \r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085.This is a bit different to the situation described in [5]: since in JFlex $ is a true trailing context, the end of file does not count as end of line.