12

I am making a Lexical Analyzer using Flex on Unix. If you've ever used it before you know that you mainly just define the regex for the tokens of whatever language you are writing the Lexical Analyzer for. I am stuck on the final part. I need the correct Regex for multi-line comments that allows something like

/* This is a comment \*/

but also allows

/* This **** //// is another type of comment */

Can anyone help with this?

Lesmana
  • 25,663
  • 9
  • 82
  • 87
LunaCodeGirl
  • 5,432
  • 6
  • 30
  • 36
  • Can you edit your question to improve the “problem” samples? They need newlines to properly express what you're having problems with, but I couldn't work out where they were missing. (Indenting by 4 spaces makes a paragraph into a sample code section.) – Donal Fellows Jan 21 '11 at 08:56
  • 1
    possible duplicate of [Why are multi-line comments in flex/bison so evasive?](http://stackoverflow.com/questions/4145498/why-are-multi-line-comments-in-flex-bison-so-evasive) – Bart Kiers Jan 21 '11 at 08:57

3 Answers3

19

You don't match C style comments with a simple regular expression in Flex; they require a more complex matching method based on start states. The Flex FAQ says how (well, they do for the /*...*/ form; handling the other form in just the <INITIAL> state should be simple).

rici
  • 234,347
  • 28
  • 237
  • 341
Donal Fellows
  • 133,037
  • 18
  • 149
  • 215
  • Ah, I figured there was a FAQ about it! :) +1 – Bart Kiers Jan 21 '11 at 09:00
  • @Bart: I found it the other day when answering a SO question (on parsing XML CDATA sections, a very similar problem in parsing terms except for the fact that it's *even more important* to do it the right way because the end-section sequence is three characters long). – Donal Fellows Jan 21 '11 at 09:03
  • If RegEx-only is necessary,"/*"( [^*] | (*+[^*/]) )*\\*+\/ would do the job. I've explained in greater detail in http://stackoverflow.com/a/32320759/3000919 – Abraham Philip Aug 31 '15 at 22:17
  • @DonalFellows please add the answer of that page to your answer. The answer could now get lost. – Tarick Welling Mar 11 '20 at 10:53
12

If you're required to make do with just regex, however, there is indeed a not-too-complex solution:

"/*"( [^*] | (\*+[^*/]) )*\*+\/

The full explanation and derivation of that regex is excellently elaborated upon here.

In short:

  • "/*" marks the start of the comment
  • ( [^*] | (\*+[^*/]) )* says accept all characters that are not '*' (the [^*] ) or accept a sequence of one or more '*' as long as the sequence does not have a '*' or a '/' following it (the (\*+[^*/])). This means that all '******...' sequences will be accepted except for '*****/' since you can't find a sequence of '*' there that isn't followed by a '*' or a '/'.
  • The '*******/' case is then handled by the last bit of the RegEx which matches any number of '*' followed by a '/' to mark the end of the comment i.e \*+\/
Piotr Siupa
  • 3,929
  • 2
  • 29
  • 65
Abraham Philip
  • 648
  • 9
  • 18
  • I don't think this regex will compile. Flex doesn't accept non-escaped white spaces in patterns. See: https://stackoverflow.com/a/52977608/3052438 – Piotr Siupa Jun 11 '23 at 11:45
  • In any case the problem here is that comments can be arbitrary long and the rules of *lex(1)* and *flex(1)* would require it to accumulate the entire rule before despatching it, which is entirely undesirable. – user207421 Jun 11 '23 at 11:56
  • @user207421 It depends. If you just want to ignore comments, that can somewhat slow down the lexer without giving any profits but if you want to capture the comment, it is desirable. – Piotr Siupa Jun 12 '23 at 12:27
1

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html does:

"/*"            { comment(); }

comment() {
    char c, c1;

loop:
    while ((c = input()) != '*' && c != 0)
        putchar(c);

    if ((c1 = input()) != '/' && c != 0) {
        unput(c1);
        goto loop;
    }

    if (c != 0)
        putchar(c1);
}

A question which would also solve this is How do I write a non-greedy match in LEX / FLEX?

Community
  • 1
  • 1
Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985