-1

Neither of the two main lexer generators commonly referenced, cl-lex and lispbuilder-lexer allow for state variables in the "action blocks", making it impossible to recognize a c-style multi-line comment, for example.

What is a lexer generator in Common Lisp that can recognize a c-style multi-line comment as a token?

Correction: This lexer actually needs to recognize nested, balanced multiline comments (not exactly C-style). So I can't do away with state-variables.

ealfonso
  • 6,622
  • 5
  • 39
  • 67

1 Answers1

2

You can recognize a C-style multiline comment with the following regular expression:

[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/]

It should work with any library which uses Posix-compatible extended regex syntax; although a bit hard to read because * is extensively used both as an operator and as a literal character, it uses no non-regular features. It does rely on inverted character classes ([^*], for example) matching the newline character, but afaik that is pretty well universal, even for regex engines in which a wildcard does not match newline.

rici
  • 234,347
  • 28
  • 237
  • 341
  • actually, in my case I need to recognize balanced, nested multiline coments. so I do think I need a comment-level state variable – ealfonso Sep 10 '15 at 03:58
  • 2
    @erjoalgo: Yes, those wouldn't be C-style, and they are not recognizable with a regular expression. Sorry for the confusion. – rici Sep 10 '15 at 04:08
  • Thanks for pointing this out. I was aware that this is possible but in my experience this (allowing repetitions on groups as opposed to only simple character classes) seems to be rarely supported? I could be wrong. – ealfonso Sep 10 '15 at 04:11
  • 2
    @erjoalgo: It's certainly available in every standard regex library, and in all the lexical generators I'm familiar with. I'm not familiar with the ones you mentioned, but I checked their docs: The `lispbuilder-lexer` brief documentation includes, for example, `"[0-9]+([.][0-9]+([Ee][0-9]+)?)"` (which is an optional group, not a repeated group, but the principle is roughly the same), and the library `cl-lex` uses claims to be mostly Perl-compatible, so it would be highly surprising if it didn't allow repeated groups. – rici Sep 10 '15 at 04:16
  • yes, I'm aware it is available in most lexical generators. I think it is another shortcomming of those two cl-lexers, that they do not seem to natively allow building on named subtokens – ealfonso Sep 10 '15 at 04:28
  • @erjoalgo cl-lex *does* support all the regular expressions mentioned above – Qudit Sep 10 '15 at 07:42
  • does it support naming a regular expression and building on it? it's probably not too difficult to get around this, but I don't think it is supported natively – ealfonso Sep 10 '15 at 14:10
  • @erjoalgo: regex macros are a feature of lex and its various derivates (flex, jflex, ocamllex, and probably many others), but they are *macros* and consequently recursive expansion is not going to work. They add nothing aside from readability to regex specifications (and there is a tendency to overuse them, removing even that advantage). Allowing recursive matching would make the "regex" into a CFG. This feature is allowed in ANTLR4, at least, but it is uncommon because it cannot make use of the highly efficient regular expression matching algorithm. – rici Sep 10 '15 at 19:41
  • @erjoalgo: The macro nature of named subexpressions is shown clearly in the PLY implementation, which is based on simple string concatenation. [documentation](http://www.dabeaz.com/ply/ply.html#ply_nn14). Looks to me like you could easily build this into cl-lex since it allows regexes to be supplied as sexprs. – rici Sep 10 '15 at 19:59
  • @erjoalgo cl-lex allows named groups and this is mentioned in the doc strings in the code. – Qudit Sep 11 '15 at 03:13
  • @qudit: If you're talking about [this feature](http://weitz.de/cl-ppcre/#*allow-named-registers*), I think that's a different concept. At least, it's different from the concept I was talking about; I don't really know what erjoalgo is looking for. – rici Sep 11 '15 at 03:38
  • @rici Yes, that's what I'm talking about. cl-lex does not support the macros you're taking about. I mentioned it because ejoalgo seemed to be saying that it did not support named groups or even repetition. This is not true. – Qudit Sep 11 '15 at 06:20
  • It's a nice feature of cl-lex to be able to capture groups within the regexp, although I was referring to the macro nature of named subexpressions. But even this is ultimately just a convenience, I would be able to get around it easily. My problem is that I do need state, I was able to get around this by hacking something into lispbuilder-lexer but I guess I was looking for a lexer closer to flex in CL – ealfonso Sep 11 '15 at 16:05