I don't know how to work with '(', ')', and '*' that can be in comment. Comments are multiline.
-
1If this `(* comment *)` is a comment then regex cannot really handle `(* zz (* zz *)` or `(* zz *) zz *)` – Johan Sjöberg Mar 27 '11 at 17:28
-
It's not clear to me, whether the question is simply about masking the parenthesis and the asterisk, or about context-recognition (Strings in comments, comments in strings, nested comments and so on). – user unknown Mar 27 '11 at 18:06
-
1@user unknown - I know how to escape parentesis and asterisk, this question ia about how to handle nested,containing * and ( or ) comments. – Andrey Moiseev Mar 29 '11 at 04:52
3 Answers
A simple pattern to handle that is:
\(\*(.*?)\*\)
Example: http://www.rubular.com/r/afqLCDssIx
You probably also want to set the single-line flag, (?s)\(\*(.*?)\*\)
Note that is doesn't handle cases like (*
in strings, or other weird combination. Your best bet is to use a parser, for example ANTLR, which alread has a ready Pascal grammar (direct link).

- 135,331
- 41
- 252
- 292
-
+1 for finding an actual grammar that includes this. Just for completeness: even with a parser generator, dealing with potentially nested comments can be tricky. Also, you need to have the grammar for the rest of the (non-comment) content. E.g. if I have no knowledge of the rest of the syntax, then just looking for an opening `(*` token may cause me to find something inside a string literal. – phooji Mar 27 '11 at 17:42
-
@phooji - Thanks! A parser should handle even nested constructs - it does so easily enough for the whole code structure. That said, I doubt `(* zz *) zz *)` is valid - very few languages actually allow nested comments. `/* 1 /* 2 */ 1 */` wouldn't pass in any C++ compiler `:)` – Kobi Mar 27 '11 at 17:48
-
2I want to underline, that you have to mask `\(\*(.*?)\*\)` the backslashes in java which looks like this `String r = "\\(\\*(.*?)\\*\\)";` in Javacode. – user unknown Mar 27 '11 at 18:04
-
@Kobi: Right. I guess I'm thinking of a more old school Lex/Yacc setup where, to parse comments, you put the lexer in a special 'comment' mode. – phooji Mar 27 '11 at 18:19
-
Actually, I took part in a programming competition, I spent a lot of time on first tasks, but there was another one: "Delete all {}, // and (* *) comments from input Pascal code". I had only 4 minutes to do it, so I decided to use Java's regular expressions. I managed to remove // and {} comments, but my third regexp didn't work well. So, as I understand now, that task should be done wuthout using regexps. But with your regexp my program, maybe, could succeed most of tests, however, so thanks for help! – Andrey Moiseev Mar 29 '11 at 05:07
-
This regex does not work with nested comments. Not innermost comments. Not outermost comments. – ridgerunner Mar 29 '11 at 15:04
-
@ridgerunner - The question doesn't ask for nested comments. Nested comments are not common, and not valid in most languages. It seems in Pascal it depends on the version: http://stackoverflow.com/questions/3842443/comments-in-pascal – Kobi Mar 29 '11 at 15:17
If you want to find the most inner nested comment for /* */ example
/*
/*
comment1
/*
comment2
*/
*/
*/
regular expression will be
\/\*[^/*]*(?:(?!\/\*|\*\/)[/*][^/*]*)*\*\/
this will find
/*
comment2
*/

- 41
- 1
Regarding the handling of nested comments, although it is true that you cannot use a Java regex to match an outermost comment, you can craft one which will match an innermost comment (with some notable exceptions - see caveats below). (Note that the: \(\*(.*?)\*\)
expression will NOT work in this case as it does not correctly match an innermost comment.) The following is a tested java program which uses a (heavily commented) regex which matches only innermost comments, and applies this in an iterative manner to correctly strip nested comments:
public class TEST {
public static void main(String[] args) {
String subjectString = "out1 (* c1 *) out2 (* c2 (* c3 *) c2 *) out3";
String regex = "" +
"# Match an innermost pascal '(*...*)' style comment.\n" +
"\\(\\* # Comment opening literal delimiter.\n" +
"[^(*]* # {normal*} Zero or more non'(', non-'*'.\n" +
"(?: # Begin {(special normal*)*} construct.\n" +
" (?! # If we are not at the start of either...\n" +
" \\(\\* # a nested comment\n" +
" | \\*\\) # or the end of this comment,\n" +
" ) [(*] # then ok to match a '(' or '*'.\n" +
" [^(*]* # more {normal*}.\n" +
")* # end {(special normal*)*} construct.\n" +
"\\*\\) # Comment closing literal delimiter.";
String resultString = null;
java.util.regex.Pattern p = java.util.regex.Pattern.compile(
regex,
java.util.regex.Pattern.COMMENTS);
java.util.regex.Matcher m = p.matcher(subjectString);
while (m.find())
{ // Iterate until there are no more "(* comments *)".
resultString = m.replaceAll("");
m = p.matcher(resultString);
}
System.out.println(resultString);
}
}
Here is the short version of the regex (in native regex format):
\(\*[^(*]*(?:(?!\(\*|\*\))[(*][^(*]*)*\*\)
Note that this regex implements Jeffrey Friedl's "Unrolling-the-loop" efficient technique and is quite fast. (See: Mastering Regular Expressions (3rd Edition)).
Caveats: This will certainly NOT work correctly if any comment delimiter (i.e. (*
or *)
) appears within a string literal and thus, should NOT be used for general parsing. But a regex like this one is handy to use from time to time - for quick and dirty searching within an editor for example.
See also my answer to a similar question for someone wanting to handle nested C-style comments.

- 1
- 1

- 33,777
- 5
- 57
- 69