0

I'm writing a jFlex lexer for Lua, and I'm having problems designing a regular expression to match one particular part of the language specification:

Literal strings can also be defined using a long format enclosed by long brackets. We define an opening long bracket of level n as an opening square bracket followed by n equal signs followed by another opening square bracket. So, an opening long bracket of level 0 is written as [[, an opening long bracket of level 1 is written as [=[, and so on. A closing long bracket is defined similarly; for instance, a closing long bracket of level 4 is written as ]====]. A long string starts with an opening long bracket of any level and ends at the first closing long bracket of the same level. Literals in this bracketed form can run for several lines, do not interpret any escape sequences, and ignore long brackets of any other level. They can contain anything except a closing bracket of the proper level.

In a nutshell, I am trying to design a regular expression that will match an opening long bracket, the string contents in between, and the closing long bracket. A match should only occur when the opening long bracket and closing long bracket have the same number of equal signs, which can be zero or more.

Tyler Levine
  • 383
  • 4
  • 8

3 Answers3

7

Well, I'm afraid tokenizing with regular expressions isn't good enough for this task. Regular expressions just aren't powerful enough.

There's no way to compare the number of '=' marks using plain regular expressions in jFlex. Perl would have a hack for that ( \1 as suggested above), but we're not talking about programming Perl, but jFlex lexer.

The solution is to go with \[=*\[ for the left bracket token, \]=*\] for the right bracket token, and then in the layer above (a parser) compare if they match in length.

Anyway, you can look at read_long_string() in the lua source code in llex.c and see how they did it without using regular expressions at all.

Tadeusz A. Kadłubowski
  • 8,047
  • 1
  • 30
  • 37
  • You could also use lexical states: match the opening bracket as `\[=*\[ `, store its length, go to a new state, match content and any closing `\]=*\]`. If the closing match has the right length, return the token, if it has the wrong length, add it to content. – lsf37 Apr 16 '15 at 20:49
  • `\1` is not a Perl hack,it is a backreference and you can find it in many regex implementations (e.g. `sed` and `grep` have them). The problem is jFlex does not support back references. – ntd Aug 04 '21 at 06:34
4
\[(=*)\[.*?\]\1\]

the \1 captures the first ().

SpliFF
  • 38,186
  • 16
  • 91
  • 120
  • This is not a valid JFlex regexp, or at least it doesn't mean what the answer says in JFlex. The `\1` has no special meaning in JFlex and is just matched as the character `1`. – lsf37 Apr 16 '15 at 20:46
3
\[(=*)\[.*?\]\1\]
Pumpuli
  • 381
  • 1
  • 2
  • 4