22

Is there a Python script or tool available which can remove comments and docstrings from Python source?

It should take care of cases like:

"""
aas
"""
def f():
    m = {
        u'x':
            u'y'
        } # faake docstring ;)
    if 1:
        'string' >> m
    if 2:
        'string' , m
    if 3:
        'string' > m

So at last I have come up with a simple script, which uses the tokenize module and removes comment tokens. It seems to work pretty well, except that I am not able to remove docstrings in all cases. See if you can improve it to remove docstrings.

import cStringIO
import tokenize

def remove_comments(src):
    """
    This reads tokens using tokenize.generate_tokens and recombines them
    using tokenize.untokenize, and skipping comment/docstring tokens in between
    """
    f = cStringIO.StringIO(src)
    class SkipException(Exception): pass
    processed_tokens = []
    last_token = None
    # go thru all the tokens and try to skip comments and docstrings
    for tok in tokenize.generate_tokens(f.readline):
        t_type, t_string, t_srow_scol, t_erow_ecol, t_line = tok

        try:
            if t_type == tokenize.COMMENT:
                raise SkipException()

            elif t_type == tokenize.STRING:

                if last_token is None or last_token[0] in [tokenize.INDENT]:
                    # FIXEME: this may remove valid strings too?
                    #raise SkipException()
                    pass

        except SkipException:
            pass
        else:
            processed_tokens.append(tok)

        last_token = tok

    return tokenize.untokenize(processed_tokens)

Also I would like to test it on a very large collection of scripts with good unit test coverage. Can you suggest such a open source project?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Anurag Uniyal
  • 85,954
  • 40
  • 175
  • 219
  • @mavnn :), don't ask, we need to safe guard our code against prying eyes LOL – Anurag Uniyal Nov 20 '09 at 09:42
  • 4
    -1: The -OO option deletes the comments from the bytecode. Why mess with anything else? It makes no sense to obfuscate the code by removing docstrings (which may contain useful unit tests.) – S.Lott Nov 20 '09 at 11:04
  • 5
    @S.Lott -OO freezes compiled code to distinct python version. I agree with you that it's not commonly useful task, but it is needed in some rare cases. Also it's a good toy task for me, so +1. – Denis Otkidach Nov 20 '09 at 11:16
  • 2
    One use case might be to count lines of code. `cloc` includes docstrings - I'd prefer if it didn't. – Jonathan Hartley Sep 23 '15 at 19:12

9 Answers9

28

I'm the author of the "mygod, he has written a python interpreter using regex..." (i.e. pyminifier) mentioned at that link below =).
I just wanted to chime in and say that I've improved the code quite a bit using the tokenizer module (which I discovered thanks to this question =) ).

You'll be happy to note that the code no longer relies so much on regular expressions and uses tokenizer to great effect. Anyway, here's the remove_comments_and_docstrings() function from pyminifier
(Note: It works properly with the edge cases that previously-posted code breaks on):

import cStringIO, tokenize
def remove_comments_and_docstrings(source):
    """
    Returns 'source' minus comments and docstrings.
    """
    io_obj = cStringIO.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        # The following two conditionals preserve indentation.
        # This is necessary because we're not using tokenize.untokenize()
        # (because it spits out code with copious amounts of oddly-placed
        # whitespace).
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        # Remove comments:
        if token_type == tokenize.COMMENT:
            pass
        # This series of conditionals removes docstrings:
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
        # This is likely a docstring; double-check we're not inside an operator:
                if prev_toktype != tokenize.NEWLINE:
                    # Note regarding NEWLINE vs NL: The tokenize module
                    # differentiates between newlines that start a new statement
                    # and newlines inside of operators such as parens, brackes,
                    # and curly braces.  Newlines inside of operators are
                    # NEWLINE and newlines that start new code are NL.
                    # Catch whole-module docstrings:
                    if start_col > 0:
                        # Unlabelled indentation means we're inside an operator
                        out += token_string
                    # Note regarding the INDENT token: The tokenize module does
                    # not label indentation inside of an operator (parens,
                    # brackets, and curly braces) as actual indentation.
                    # For example:
                    # def foo():
                    #     "The spaces before this docstring are tokenize.INDENT"
                    #     test = [
                    #         "The spaces before this string do not get a token"
                    #     ]
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    return out
Community
  • 1
  • 1
Dan McDougall
  • 9,417
  • 3
  • 17
  • 17
13

This does the job:

""" Strip comments and docstrings from a file.
"""

import sys, token, tokenize

def do_file(fname):
    """ Run on just one file.

    """
    source = open(fname)
    mod = open(fname + ",strip", "w")

    prev_toktype = token.INDENT
    first_line = None
    last_lineno = -1
    last_col = 0

    tokgen = tokenize.generate_tokens(source.readline)
    for toktype, ttext, (slineno, scol), (elineno, ecol), ltext in tokgen:
        if 0:   # Change to if 1 to see the tokens fly by.
            print("%10s %-14s %-20r %r" % (
                tokenize.tok_name.get(toktype, toktype),
                "%d.%d-%d.%d" % (slineno, scol, elineno, ecol),
                ttext, ltext
                ))
        if slineno > last_lineno:
            last_col = 0
        if scol > last_col:
            mod.write(" " * (scol - last_col))
        if toktype == token.STRING and prev_toktype == token.INDENT:
            # Docstring
            mod.write("#--")
        elif toktype == tokenize.COMMENT:
            # Comment
            mod.write("##\n")
        else:
            mod.write(ttext)
        prev_toktype = toktype
        last_col = ecol
        last_lineno = elineno

if __name__ == '__main__':
    do_file(sys.argv[1])

I'm leaving stub comments in the place of docstrings and comments since it simplifies the code. If you remove them completely, you also have to get rid of indentation before them.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • Although it looks OK for most cases in practice, it's not generally correct. Just imagine statements like `'string' >> obj` which for example stores string in `obj` (having corresponding `__rrshift__()` method). – Denis Otkidach Nov 20 '09 at 10:42
  • 3
    This has other problems as well. For example, if a function *only* has a docstring, the result is syntactically invalid. Also, there seem to be some confusing issues with tab handling (not that anyone should use tabs). It might be interesting to see a proper version of this idea based on an AST instead of a token stream. – Jean-Paul Calderone Feb 07 '13 at 20:47
10

Here is a modification of Dan's solution to make it run for Python3 + also remove empty lines + make it ready-to-use:

import io, tokenize, re
def remove_comments_and_docstrings(source):
    io_obj = io.StringIO(source)
    out = ""
    prev_toktype = tokenize.INDENT
    last_lineno = -1
    last_col = 0
    for tok in tokenize.generate_tokens(io_obj.readline):
        token_type = tok[0]
        token_string = tok[1]
        start_line, start_col = tok[2]
        end_line, end_col = tok[3]
        ltext = tok[4]
        if start_line > last_lineno:
            last_col = 0
        if start_col > last_col:
            out += (" " * (start_col - last_col))
        if token_type == tokenize.COMMENT:
            pass
        elif token_type == tokenize.STRING:
            if prev_toktype != tokenize.INDENT:
                if prev_toktype != tokenize.NEWLINE:
                    if start_col > 0:
                        out += token_string
        else:
            out += token_string
        prev_toktype = token_type
        last_col = end_col
        last_lineno = end_line
    out = '\n'.join(l for l in out.splitlines() if l.strip())
    return out
with open('test.py', 'r') as f:
    print(remove_comments_and_docstrings(f.read()))
Basj
  • 41,386
  • 99
  • 383
  • 673
5

I found an easier way to do this with the ast and astunparse module (available from pip). It converts the code text into a syntax tree, and then the astunparse module prints the code back out again without the comments. I had to strip out the docstrings with a simple matching, but it seems to work. I've been looking through output and so far the only downside of this method is that it strips all newlines from your code.

import ast, astunparse

with open('my_module.py') as f:
    lines = astunparse.unparse(ast.parse(f.read())).split('\n')
    for line in lines:
        if line.lstrip()[:1] not in ("'", '"'):
            print(line)
SurpriseDog
  • 462
  • 8
  • 18
  • 2
    This is the only correct way, imo. That `lstrip() ... in` command should be replaced by looking at the ast and excluding docstring nodes tho. It's relying on unparse to behave a certain way, and never use multiline statements, etc. – Erik Aronesty Dec 09 '20 at 19:54
  • 1
    @ErikAronesty In ast parse all docstrings will be just constants with strings, it's better to unparse and just filter lines which start like strings – Demetry Pascal May 23 '23 at 11:06
1

Try testing each chunk of tokens ending with NEWLINE. Then correct pattern for docstring (including cases where it serves as comment, but isn't assigned to __doc__) I believe is (assuming match is performed from start of file of after NEWLINE):

( DEDENT+ | INDENT? ) STRING+ COMMENT? NEWLINE

This should handle all tricky cases: string concatenation, line continuation, module/class/function docstrings, comment in the sameline after string. Note, there is a difference between NL and NEWLINE tokens, so we don't need to worry about single string of the line inside expression.

Denis Otkidach
  • 32,032
  • 8
  • 79
  • 100
0

I've just used the code given by Dan McDougall, and I've found two problems.

  1. There were too many empty new lines, so I decided to remove line every time we have two consecutive new lines
  2. When the Python code was processed all spaces were missing (except indentation) and so such things as "import Anything" changed into "importAnything" which caused problems. I added spaces after and before reserved Python words which needed it done. I hope I didn't make any mistake there.

I think I have fixed both things with adding (before return) few more lines:

# Removing unneeded newlines from string
buffered_content = cStringIO.StringIO(content) # Takes the string generated by Dan McDougall's code as input
content_without_newlines = ""
previous_token_type = tokenize.NEWLINE
for tokens in tokenize.generate_tokens(buffered_content.readline):
    token_type = tokens[0]
    token_string = tokens[1]
    if previous_token_type == tokenize.NL and token_type == tokenize.NL:
        pass
    else:
        # add necessary spaces
        prev_space = ''
        next_space = ''
        if token_string in ['and', 'as', 'or', 'in', 'is']:
            prev_space = ' '
        if token_string in ['and', 'del', 'from', 'not', 'while', 'as', 'elif', 'global', 'or', 'with', 'assert', 'if', 'yield', 'except', 'import', 'print', 'class', 'exec', 'in', 'raise', 'is', 'return', 'def', 'for', 'lambda']:
            next_space = ' '
        content_without_newlines += prev_space + token_string + next_space # This will be our new output!
    previous_token_type = token_type
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
p4r4noj4
  • 93
  • 1
  • 9
0

I was trying to create a program that would count all lines in a python file, ignoring blank lines, lines with comments and docstrings. Here is my solution:

with open(file_path, 'r', encoding='utf-8') as pyt_file:
  count = 0
  docstring = False

  for i_line in pyt_file.readlines():

    cur_line = i_line.rstrip().replace(' ', '')

    if cur_line.startswith('"""') and not docstring:
      marks_counter = Counter(cur_line)
      if marks_counter['"'] == 6:
        count -= 1
      else:
        docstring = True

    elif cur_line.startswith('"""') and docstring:
      count -= 1
      docstring = False

    if len(cur_line) > 0 and not cur_line.startswith('#') and not docstring:
      count += 1

My problem was to detect the docstrings (including both one-lines and multi-lines), so I suppose if you want to delete those you can try to use the same Flag-solution.

P.S. I understand that it is an old quiestion but when I was dealing with my problem I couldn't find anything simple and effective

dani_p
  • 1
0

I recommend to use this code (according to @SurpriseDog)

from typing import Any

import ast
from ast import Constant
import astunparse  # pip install astunparse

class NewLineProcessor(ast.NodeTransformer):
    """class for keeping '\n' chars inside python strings during ast unparse"""
    def visit_Constant(self, node: Constant) -> Any:
        if isinstance(node.value, str):
            node.value = node.value.replace('\n', '\\n')
        return node

with open(file_from) as f:
    tree = ast.parse(f.read())
    tree = NewLineProcessor().visit(tree)
    lines = astunparse.unparse(tree).split('\n')
    print(lines)

Demetry Pascal
  • 383
  • 4
  • 15
0

Although the question was asked more than a decade ago, I wrote this to address that same issue - wanted them removed for compilation.

import ast
import astor
import re

def remove_docs_and_comments(file):
    with open(file,"r") as f:
        code = f.read() 
    parsed = ast.parse(code)
    for node in ast.walk(parsed):
        if isinstance(node, ast.Expr) and isinstance(node.value, ast.Str):
            # set value to empty string
            node.value = ast.Constant(value='') 
    formatted_code = astor.to_source(parsed)  
    pattern = r'^.*"""""".*$' # remove empty """"""
    formatted_code = re.sub(pattern, '', formatted_code, flags=re.MULTILINE) 
    return formatted_code 

remove_docs_and_comments("your_script.py")

It will return condensed code without docstrings and comments.

netrox
  • 5,224
  • 14
  • 46
  • 60