Optional JFlex lookahead with End of File

Question

I'm attempting to write a lexer for Fitnesse using JFlex and am having trouble with WikiWords (http://fitnesse.org/FitNesse.UserGuide.WikiWord)

I copied over the regex linked and am using the following regex for tokens:

.                               # Regular character
[A-Z]([a-z0-9]+[A-Z][a-z0-9]*)+ # WikiWord

I'm having trouble properly lexing ThisIsNotAWikiWord though. It has 2 capitals in a row, so it should not be considered a regular word. So I need to add a lookahead to check if the next character is a letter or digit. Something like [A-Z]([a-z0-9]+[A-Z][a-z0-9]*)+ / [^A-Za-z0-9].

This works fine for lexing ThisIsNotAWikiWord, but it breaks lexing WikiWords in general. When lexing WikiWord, there is no extra character for the lookahead, so it doesn't match.

I think I want an optional lookahead. if there is a character after this, then it better not be one of these. But if there isn't another character in the input, let's match.

The documentation leads me to believe this isn't possible, but I'm hoping it's just my lack of regex-fu. From the docs:

In a lexical rule, a regular expression r may be followed by a look-ahead expression. A look-ahead expression is either a '$' (the end of line operator) or a '/' followed by an arbitrary regular expression. In both cases the look-ahead is not consumed and not included in the matched text region, but it is considered while determining which rule has the longest match (see also 4.3.3 How the input is matched).

In the '$' case r is only matched at the end of a line in the input. The end of a line is denoted by the regular expression \r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085. So a$ is equivalent to a / \r|\n|\r\n|\u2028|\u2029|\u000B|\u000C|\u0085.This is a bit different to the situation described in [5]: since in JFlex $ is a true trailing context, the end of file does not count as end of line.

d_inevitable · Answer 1 · 2012-08-01T07:38:42.190

0

Look arounds are something doesn't seem to be needed here.

As far as I understood, you are looking for camel-cased words that start with an uppercase letters, but can contain numbers, where a number counts as a lower-case letter and each camel-bump must be one upper-case letter only. If that is right, this regexp should work for you:

\b((?:[A-Z][a-z\d]+){2,})\b

The (?: part makes the parenthesis non-capturing.

[A-Z][a-z\d]+ makes sure that exactly one uppercase character is followed by at least one lower case character.

{2,} Forces the pattern to repeat at least twice so that at least one camel-hump will be produced.

edited Aug 01 '12 at 07:38

answered Jul 30 '12 at 06:39

d_inevitable

4,381
2
29
48

You should be using proper word boundaries (`\b`) or your regex will fail on two consecutive WikiWords separated by one space. – Tim Pietzcker Jul 30 '12 at 06:53
@TimPietzcker thanks. You are right about the single space, but the word boundaries are not exactly the same. They also match punctuation, but then its not clear what the required boundaries are. If punctuation is not allowed as a boundary, then lookarounds are needed after all... – d_inevitable Jul 30 '12 at 07:40
@d_inevitable Thanks for the quick response and sorry for the delay. Unfortunately, JFlex doesn't seem to support `(?:` With just `\b([A-Z][a-z\d]+([A-Z][a-z\d]+)+)\b`, JFlex claims it could not match the input for `WikiWord` for some reason. :( – George Shakhnazaryan Aug 01 '12 at 01:10
1

@GeorgeShakhnazaryan the `(?:` was optional anyway, if it doesn't support it and you need to capture portions of your match, then you just need to adjust the offsets accordingly. The only way in which I can imagine it not matching `WikiWord` is if its at the string start or end and `\b` is not the same as all regexp engines that I know of. Try replacing it with `(\b|^)` and `(\b|$)` respectively. Otherwise send me your full string that you are matching it against. – d_inevitable Aug 01 '12 at 04:00
@d_inevitable The original and modified regex you gave make sense. Unfortunately, the regex support in JFlex seems to be limited. When processing `(\b|^)([A-Z][a-z\d]+([A-Z][a-z\d]+)+)(\b|$)`, JFlex gives a syntax error. It only seems to accept `^` and `$` as the first and last characters. And `$` only matches the end of a line, not end of input. On the bright side, I think I found another solution to my problem. Before, I was lexing `.` as non-wikiword text. Instead, I can just lex `[A-Za-z0-9]` as non-wikiwords. JFlex will use the regex that matches the longest input. Thanks, George – George Shakhnazaryan Aug 03 '12 at 02:45
@GeorgeShakhnazaryan the new pattern is exactly the same but only simplified, because `xx+` is the same as `x{2,}`. I don't understand what you mean by matching non-wikiwords. And neither `.` nor `[A-Za-z0-9]` makes sense to me. But I've never used JFlex before. – d_inevitable Aug 04 '12 at 14:22
@d_inevitable I'm attempted to write a lexer for a Fitnesse file. One type of element is a WikiWord. Another is non-WikiWord text. I need a regex to capture a WikiWord, and another regex to capture non-WikiWord text. I was using `.` to do the latter, but doing `[A-Za-z0-9]` also works and also helps solve my original problem. The key to the solution was that JFlex attempts to capture the most characters with a regex as possible. So if the WikiWord regex captures 10 characters, and the non-WikiWord text regex captures 11 characters, JFlex will take the 11 and interpret as non-WikiWord text. – George Shakhnazaryan Aug 05 '12 at 17:22
@GeorgeShakhnazaryan both `.` and `[A-Za-z0-9]` will overlap with the the wiki-word. I think what you may need is this: `(\W|(\b[a-z\d].*?\b)|(\b.*?[A-Z]{2,}.*?\b)|(\b.\b))`. This basically says that anything that is not word character or any word that contains two upper-case letters in succession or any single char words is not a wiki word. – d_inevitable Aug 06 '12 at 07:42
Actually make that `(\W|(\b[^A-Z].*?\b)|(\b.*?[A-Z]{2,}.*?\b)|(\b.\b)|(\b[A-Z][^A-Z]+\b))`. The last part says that `Hello` is not a wiki word (any word that has starts upper-case, but doesn't have successive uppercase characters). Its also better use to `[^A-Z]` to make sure nothing is missed. – d_inevitable Aug 06 '12 at 07:50

Optional JFlex lookahead with End of File

1 Answers1