22

I've been scripting something that has to do with scanning directories and noticed a severe memory leak when calling os.path.isdir, so I've tried the following snippet:

def func():
    if not os.path.isdir('D:\Downloads'):
        return False
while True:
    func()

Within a few seconds, the Python process reached 100MB RAM.

I'm trying to figure out what's going on. It seems like the huge memory leak is in effect only when the path is indeed a valid directory path (meaning the 'return False' is not executed). Also, it is interesting to see what happens in related calls, like os.path.isfile.

Thoughts?

Edit: I think I'm onto something. Although isfile and isdir are implemented in the genericpath module, on Windows system - isdir is being imported from the builtin nt. So I had to download the 2.7.3 source (which I should've done long time ago...).

After a little bit of searching, I found out posix__isdir function in \Modules\posixmodule.c, which I assume is the 'isdir' function imported from nt.

This part of the function (and comment) caught my eye:

if (PyArg_ParseTuple(args, "U|:_isdir", &po)) {
        Py_UNICODE *wpath = PyUnicode_AS_UNICODE(po);

        attributes = GetFileAttributesW(wpath);
        if (attributes == INVALID_FILE_ATTRIBUTES)
            Py_RETURN_FALSE;
        goto check;
    }
    /* Drop the argument parsing error as narrow strings
       are also valid. */
    PyErr_Clear();

It seems that it all boils down to Unicode/ASCII handling bug.

I've just tried my snippet above with path argument in unicode (i.e. u'D:\Downloads') - no memory leak whatsoever. haha.

AAlon
  • 434
  • 3
  • 8
  • 2
    Hmm... this bears investigating further. I get the memory leak regardless of whether the directory exists. Python 2.7.3 on Win7 64-bit. – Chinmay Kanchi Sep 29 '12 at 00:21
  • 4
    Nicely done! Probably worth reporting this to the Python bug tracker at http://bugs.python.org – Chinmay Kanchi Sep 29 '12 at 01:32
  • Yeah! Nice job! You should definitely report this on the bug tracker! – David S Sep 29 '12 at 04:54
  • P.S. If you submit a patch with the bug report (I think in your case it is just `PyMem_Free(path)` after `GetFileAttributesA`), you can get your name in the `ACKS` file: http://hg.python.org/cpython/file/tip/Misc/ACKS :) – nneonneo Sep 29 '12 at 18:19

1 Answers1

8

The root cause is a failure to call PyMem_Free on the path variable in the non-Unicode path:

    if (!PyArg_ParseTuple(args, "et:_isdir",
                          Py_FileSystemDefaultEncoding, &path))
        return NULL;

    attributes = GetFileAttributesA(path);
    if (attributes == INVALID_FILE_ATTRIBUTES)
        Py_RETURN_FALSE;

check:
    if (attributes & FILE_ATTRIBUTE_DIRECTORY)
        Py_RETURN_TRUE;
    else
        Py_RETURN_FALSE;

As per the documentation on PyArg_ParseTuple:

  • et: Same as es...
  • es: PyArg_ParseTuple() will allocate a buffer of the needed size, copy the encoded data into this buffer and adjust *buffer to reference the newly allocated storage. The caller is responsible for calling PyMem_Free() to free the allocated buffer after use.

It's a bug in Python's standard library (fixed in Python 3 by using bytes objects directly); file a bug report at http://bugs.python.org.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • Well, so it is a bug in the ASCII handling part of the function. Thanks, this finishes up the analysis! – AAlon Sep 29 '12 at 18:08
  • You're right, it is technically related to Unicode/ASCII. Adjusted my answer, thanks. Great job on the analysis. – nneonneo Sep 29 '12 at 18:10