1

I don't know how to work with '(', ')', and '*' that can be in comment. Comments are multiline.

Andrey Moiseev
  • 3,914
  • 7
  • 48
  • 64
  • 1
    If this `(* comment *)` is a comment then regex cannot really handle `(* zz (* zz *)` or `(* zz *) zz *)` – Johan Sjöberg Mar 27 '11 at 17:28
  • It's not clear to me, whether the question is simply about masking the parenthesis and the asterisk, or about context-recognition (Strings in comments, comments in strings, nested comments and so on). – user unknown Mar 27 '11 at 18:06
  • 1
    @user unknown - I know how to escape parentesis and asterisk, this question ia about how to handle nested,containing * and ( or ) comments. – Andrey Moiseev Mar 29 '11 at 04:52

3 Answers3

6

A simple pattern to handle that is:

\(\*(.*?)\*\)

Example: http://www.rubular.com/r/afqLCDssIx

You probably also want to set the single-line flag, (?s)\(\*(.*?)\*\)

Note that is doesn't handle cases like (* in strings, or other weird combination. Your best bet is to use a parser, for example ANTLR, which alread has a ready Pascal grammar (direct link).

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • +1 for finding an actual grammar that includes this. Just for completeness: even with a parser generator, dealing with potentially nested comments can be tricky. Also, you need to have the grammar for the rest of the (non-comment) content. E.g. if I have no knowledge of the rest of the syntax, then just looking for an opening `(*` token may cause me to find something inside a string literal. – phooji Mar 27 '11 at 17:42
  • @phooji - Thanks! A parser should handle even nested constructs - it does so easily enough for the whole code structure. That said, I doubt `(* zz *) zz *)` is valid - very few languages actually allow nested comments. `/* 1 /* 2 */ 1 */` wouldn't pass in any C++ compiler `:)` – Kobi Mar 27 '11 at 17:48
  • 2
    I want to underline, that you have to mask `\(\*(.*?)\*\)` the backslashes in java which looks like this `String r = "\\(\\*(.*?)\\*\\)";` in Javacode. – user unknown Mar 27 '11 at 18:04
  • @Kobi: Right. I guess I'm thinking of a more old school Lex/Yacc setup where, to parse comments, you put the lexer in a special 'comment' mode. – phooji Mar 27 '11 at 18:19
  • Actually, I took part in a programming competition, I spent a lot of time on first tasks, but there was another one: "Delete all {}, // and (* *) comments from input Pascal code". I had only 4 minutes to do it, so I decided to use Java's regular expressions. I managed to remove // and {} comments, but my third regexp didn't work well. So, as I understand now, that task should be done wuthout using regexps. But with your regexp my program, maybe, could succeed most of tests, however, so thanks for help! – Andrey Moiseev Mar 29 '11 at 05:07
  • This regex does not work with nested comments. Not innermost comments. Not outermost comments. – ridgerunner Mar 29 '11 at 15:04
  • @ridgerunner - The question doesn't ask for nested comments. Nested comments are not common, and not valid in most languages. It seems in Pascal it depends on the version: http://stackoverflow.com/questions/3842443/comments-in-pascal – Kobi Mar 29 '11 at 15:17
4

If you want to find the most inner nested comment for /* */ example

/* 
/*
comment1
/*
comment2
*/
*/
*/

regular expression will be

\/\*[^/*]*(?:(?!\/\*|\*\/)[/*][^/*]*)*\*\/

this will find

/*
comment2
*/
LanMi
  • 41
  • 1
1

Regarding the handling of nested comments, although it is true that you cannot use a Java regex to match an outermost comment, you can craft one which will match an innermost comment (with some notable exceptions - see caveats below). (Note that the: \(\*(.*?)\*\) expression will NOT work in this case as it does not correctly match an innermost comment.) The following is a tested java program which uses a (heavily commented) regex which matches only innermost comments, and applies this in an iterative manner to correctly strip nested comments:

public class TEST {
    public static void main(String[] args) {
        String subjectString = "out1 (* c1 *) out2 (* c2 (* c3 *) c2 *) out3";
        String regex = "" +
            "# Match an innermost pascal '(*...*)' style comment.\n" +
            "\\(\\*      # Comment opening literal delimiter.\n" +
            "[^(*]*      # {normal*} Zero or more non'(', non-'*'.\n" +
            "(?:         # Begin {(special normal*)*} construct.\n" +
            "  (?!       # If we are not at the start of either...\n" +
            "    \\(\\*  # a nested comment\n" +
            "  | \\*\\)  # or the end of this comment,\n" +
            "  ) [(*]    # then ok to match a '(' or '*'.\n" +
            "  [^(*]*    # more {normal*}.\n" +
            ")*          # end {(special normal*)*} construct.\n" +
            "\\*\\)      # Comment closing literal delimiter.";
        String resultString = null;
        java.util.regex.Pattern p = java.util.regex.Pattern.compile(
                    regex,
                    java.util.regex.Pattern.COMMENTS);
        java.util.regex.Matcher m = p.matcher(subjectString);
        while (m.find())
        { // Iterate until there are no more "(* comments *)".
            resultString = m.replaceAll("");
            m = p.matcher(resultString);
        }
        System.out.println(resultString);
    }
}

Here is the short version of the regex (in native regex format):

\(\*[^(*]*(?:(?!\(\*|\*\))[(*][^(*]*)*\*\)

Note that this regex implements Jeffrey Friedl's "Unrolling-the-loop" efficient technique and is quite fast. (See: Mastering Regular Expressions (3rd Edition)).

Caveats: This will certainly NOT work correctly if any comment delimiter (i.e. (* or *)) appears within a string literal and thus, should NOT be used for general parsing. But a regex like this one is handy to use from time to time - for quick and dirty searching within an editor for example.

See also my answer to a similar question for someone wanting to handle nested C-style comments.

Community
  • 1
  • 1
ridgerunner
  • 33,777
  • 5
  • 57
  • 69