7

Can anyone get me with the regular expression to strip multiline comments and single line comments in a file?

eg:

                  " WHOLE "/*...*/" HAS TO BE STRIPED OFF....."

1.   /* comment */
2.   /* comment1 */  code   /* comment2 */ #both /*comment1*/ and /*comment2*/ 
                                             #has to striped off and rest should 
                                                 #remain.
3.   /*.........
       .........
       .........
       ......... */

i realy appreciate you if u do this need.... thanks in advance.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
User1611
  • 1,081
  • 4
  • 18
  • 27
  • 1
    As a rule of thumb I've found that when you're trying to programmatically manipulate a language like C, XML, SQL, etc. you should really be thinking of a parser not regexs. I'd highly recommend learning about parser generators, yacc, javacc, etc. It's had huge payoffs for me as a software developer. – zimbu668 Aug 18 '09 at 14:06
  • @zimbu668 this is situation where using a parser is very much overkill. There is no nesting or complex structure here, just simple comments – Shipof123 Nov 08 '20 at 00:27

6 Answers6

17

From perlfaq6 "How do I use a regular expression to strip C style comments from a file?":


While this actually can be done, it's much harder than you'd think. For example, this one-liner

perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c

will work in many but not all cases. You see, it's too simple-minded for certain kinds of C programs, in particular, those with what appear to be comments in quoted strings. For that, you'd need something like this, created by Jeffrey Friedl and later modified by Fred Curtis.

$/ = undef;
$_ = <>;
s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
print;

This could, of course, be more legibly written with the /x modifier, adding whitespace and comments. Here it is expanded, courtesy of Fred Curtis.

s{
   /\*         ##  Start of /* ... */ comment
   [^*]*\*+    ##  Non-* followed by 1-or-more *'s
   (
     [^/*][^*]*\*+
   )*          ##  0-or-more things which don't start with /
               ##    but do end with '*'
   /           ##  End of /* ... */ comment

 |         ##     OR  various things which aren't comments:

   (
     "           ##  Start of " ... " string
     (
       \\.           ##  Escaped char
     |               ##    OR
       [^"\\]        ##  Non "\
     )*
     "           ##  End of " ... " string

   |         ##     OR

     '           ##  Start of ' ... ' string
     (
       \\.           ##  Escaped char
     |               ##    OR
       [^'\\]        ##  Non '\
     )*
     '           ##  End of ' ... ' string

   |         ##     OR

     .           ##  Anything other char
     [^/"'\\]*   ##  Chars which doesn't start a comment, string or escape
   )
 }{defined $2 ? $2 : ""}gxse;

A slight modification also removes C++ comments, possibly spanning multiple lines using a continuation character:

 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • brian, that functionality almost could be added to perl, it seems to be asked so much. at least IMO. – Paul Nathan Aug 06 '09 at 21:43
  • ...and this is why we have tools like yacc, flex, bison, ANTLR, etc. This is something you need a full-blown parser for, not a regex. – Adam Rosenfield Aug 06 '09 at 21:49
  • 2
    @Paul: That functionality is already in Perl. Perl is a general purpose language. We don't want to add built-in features for every task that comes along. That's the job for modules. – brian d foy Aug 07 '09 at 02:15
11

As often in Perl, you can reach for the CPAN: Regexp::Common::Comment should help you. The one language I found that uses the comments you described is Nickle, but maybe PHP comments would be OK (// can also start a single-line comment).

Note that in any case, using regexps to strip out comment is dangerous, a full-parser for the language is much less risky. A regexp-parser for example is likely to get confused by something like print "/*";.

mirod
  • 15,923
  • 3
  • 45
  • 65
6

This is a FAQ:

perldoc -q comment

Found in perlfaq6:

How do I use a regular expression to strip C style comments from a file?

While this actually can be done, it's much harder than you'd think. For example, this one-liner ...

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • You can link to perlfaqs at http://faq.perl.org (always the latest version), or perldoc.perl.org. That way those sites get good google juice for the people who search for answers. :) – brian d foy Aug 07 '09 at 02:17
1

There is also a non-perl answer: use the program stripcmt:

StripCmt is a simple utility written in C to remove comments from C, C++, and Java source files. In the grand tradition of Unix text processing programs, it can function either as a FIFO (First In - First Out) filter or accept arguments on the commandline.

hlovdal
  • 26,565
  • 10
  • 94
  • 165
0

Remove /* */ comments (including multi-line)

s/\/\*.*?\*\///gs

I post this because it is simple, however I believe it will trip up on embedded comments like

/* sdafsdfsdf /*sda asd*/ asdsdf */

But as they are fairly uncommon I prefer the simple regex.

gacrux
  • 930
  • 1
  • 9
  • 16
-3

Including tests:

use strict;
use warnings;
use Test::More qw(no_plan);
sub strip_comments {
  my $string=shift;
  $string =~ s#/\*.*?\*/##sg; #strip multiline C comments
  return $string;
}
is(strip_comments('a/* comment1 */  code   /* comment2 */b'),'a  code   b');
is(strip_comments('a/* comment1 /* comment2 */b'),'ab');
is(strip_comments("a/* comment1\n\ncomment */ code /* comment2 */b"),'a code b');
Alexandr Ciornii
  • 7,346
  • 1
  • 25
  • 29
  • 3
    Will mess up /* or */ appearing in a string. E.g. the string "This /* string" does not include a comment start. – Richard May 18 '09 at 14:05
  • 2
    As well as not handling comment characters in strings (or even multi-character character constants), it also does not handle backslash-newline splicing which permits the opening slash to be followed by backslash, newline and then asterisk, for example. Also does not handle C++ comments (which can also have backslash-newline splicing). And it doesn't handle trigraphs - the only relevant one is '??/' which means backslash. How much this matters depends on how bullet-proof your code needs to be. – Jonathan Leffler Aug 06 '09 at 21:50
  • mirod's answer is much better. – Chris Huang-Leaver Aug 18 '09 at 15:19
  • 2
    Replacing comments with the empty string is also wrong. It will change the semantics of code when tokens are spliced accidently. The C Standard requires comments to be replaced by a single space character in translation phase 3. – Jens Mar 31 '12 at 09:49