5

I have a simple directory structure:

rootdir\
    subdir1\
        file1.tif
    subdir2\
        file2.tif
    ...
    subdir13\
        file13.tif
    subdir14\
        file14.tif

If I call:

import os

print os.listdir('absolute\path\to\rootdir')

...then I get what you'd expect:

['subdir1', 'subdir2', ... 'subdir13', 'subdir14']

Same thing happens if I call os.listdir() on those sub-directories. For each one it returns the name of the file in that directory. No problems there.

And if I call:

import os

for dirpath, dirnames, filenames in os.walk('absolute\path\to\rootdir'):
    print filenames
    print dirnames

...then I get what you'd expect:

[]
['subdir1', 'subdir2', ... 'subdir13', 'subdir14']
['file1.tif']
[]
['file2.tif']
[]
...

But here's the strangeness. When I call:

import os

for dirpath, dirnames, filenames in os.walk('absolute\path\to\rootdir'):
    print filenames
    print dirnames
    print dirpath

...it never returns, ever. Even if I try:

print [each[0] for each in os.walk('absolute\path\to\roodir')]

...or anything of the sort. I can always print the second and third parts of the tuple returned by os.walk(), but the moment that I try to touch the first part the whole thing just stops.

Even stranger, this behavior only appears in scripts launched using the shell. The command line interpreter acts normally. I'm curious, what's going on here?

-----EDIT----- Actual code:

ALLOWED_IMGFORMATS = [".jpg",".tif"]

def getCategorizedFiles(pathname):
    cats = [each[0] for each in os.walk(pathname) if not each[0] == pathname]
    ncats = len(cats)
    tree = [[] for i in range(ncats+1)]
    for cat in cats:
        catnum = int(os.path.basename(cat))
        for item in os.listdir(cat):
            if not item.endswith('.sift') and os.path.splitext(item)[-1].lower() in ALLOWED_IMGFORMATS:
                tree[catnum].append(cat + '\\' + item)
    fileDict = {cat : tree[cat] for cat in range(1,ncats+1)}
    return fileDict

----EDIT 2---- Another development. As stated above, this problem exists when the code is in scripts launched from the shell. But not any shell. The problem exists with Console 2, but not the Windows command prompt. It also exists when the script is launched from java (how I originally came across the problem) like so: http://www.programmersheaven.com/mb/python/415726/415726/invoking-python-script-from-java/?S=B20000

ciph345
  • 51
  • 3
  • Not sure what the problem is. I just copy pasted your code in a script, ran ( I am using `python 2.7` ) and it run just as expected – Anshul Aug 15 '13 at 19:17
  • 7
    Careful with those backslashes. Why not use forward slashes? They work on windows and won't produce weird escaping issues. – user2357112 Aug 15 '13 at 19:18
  • Maybe the problem is windows specific. Did you try using a debugger to see what code it's executing when it hangs? – arghbleargh Aug 15 '13 at 19:19
  • 4
    Can you show actual code that demonstrates the problem? You may have accidentally removed the bug while sanitizing your code; `absolute\path\to\rootdir` definitely isn't the real path. – user2357112 Aug 15 '13 at 19:21
  • @user2357112: Great point. The OP's path is actually looking for a directory in `absolute` whose name includes a tab and a CR in it, which is very unlikely to match anything… – abarnert Aug 15 '13 at 19:22
  • Are you saying that os.walk() itself never returns? That is, none of the prints are hit and the for loop doesn't end? Nothing gets printed? – tdelaney Aug 15 '13 at 19:33
  • I suspect the problem is a typo. `absolute\path\to\roodir` is missing a `t`; if the actual code has similar typos, the search won't find anything. – user2357112 Aug 15 '13 at 19:40
  • I am also using python 2.7. Thanks for the tip about forward slashes. I switched over to using them, but it hasn't changed anything. The actual path is "C:\\Program Files (x86)\\ImageJ\\sliceregistration\\data\\memoizations" (or with forward slashes), and the sub-directories are named "1" through "14", each containing a file "1.tif", "2.tif", etc. Also see the edit. – ciph345 Aug 15 '13 at 19:43
  • yes @tdelaney, that's exactly what happens. – ciph345 Aug 15 '13 at 19:46
  • that's very strange. os.walk() doesn't know that you are going to use dirpath later... it shouldn'tmake any difference. – tdelaney Aug 15 '13 at 19:57
  • Are you sure the loop never terminates? Put a print statement and a `sys.stdout.flush()` after the loop to make sure. – Thomas Aug 15 '13 at 20:01
  • Desperate experimentation reveals that I can print any combination of five dirpaths after calling os.walk(), but no more than five. I also observe that these five collective dirpaths contain more characters than all of my dirnames and filenames combined. And eliminating print statements that occur earlier in the program allow me to print more dirpaths. And printing to a log (print >> open('log.txt', 'a') allows me to print everything. I think that my problem is not with os.walk() at all--which makes more sense--and instead has to do with the amount that I'm trying to direct to stdout/err. – ciph345 Aug 15 '13 at 20:46
  • Which is real strange, considering that I can't be printing more than 20 lines each of length < 100 chars over the course of the whole program. But if I'm somehow overloading console2's buffer and java's BufferedReader's maximum buffer size, then it explains the appearance of hanging at os.walk(), since stderr gets silenced too. – ciph345 Aug 15 '13 at 20:49
  • But let's say that I am indeed "printing too much" and past some critical number of printed characters, nothing more appears on stdout or stderr... then the program should still terminate no? So how come using 'ps' in Console2 shows that my program is actually stuck? And Java's proc.waitFor() never returns. – ciph345 Aug 15 '13 at 20:55
  • My initial guess was that special character combination was creating some sort of escape sequence for the shell which kept it hanging. But I can't reproduce it on Windows 8, Console2 2.00.148, python 2.7.5. Tried different directory structures, including spaces and special characters. Maybe its a bug in python or Console2. What version of the software you are using, or are you using a custom shell in Console2 (see Console2's settings)? Can you use a simple batch script to echo the same output (such as from your log file), and see if that suffers from the same output? – catchmeifyoutry Aug 15 '13 at 22:01
  • Windows 7, Console 2.00.148, python 2.75. I'd made only superficial changes to the default settings (window size, some color changes, etc), but re-installing Console2 fixed the problem, so I must have done something somewhere... Unfortunately I still need this launched from a Java class, and that behavior persists. My workaround, which is fine I suppose, was to create my own scrolling text window in wx and redirect stdout/err to it. Works as well as I need it to. – ciph345 Aug 16 '13 at 16:32

1 Answers1

1

I've never really trusted os.walk(). Just write your own recursive stuff. It's not hard:

def contents(folder, l): # Recursive, returns list of all files with full paths
    directContents = os.listdir(folder)
    for item in directContents:
        if os.path.isfile(os.path.join(folder, item)):
            l.append(os.path.join(folder, item))
        else:contents(os.path.join(folder, item), l)
    return l
contents = contents(folder, [])

contents will then be a list of all the files with full paths included. You can use os.split() if you like to make it a little easier to read.

Knowing how this works eliminates the uncertainty of using os.walk() in your code, which means you'll be able to identify if the problem in your code is really involved with os.walk().

If you need to put them in a dictionary (because dictionaries have aliasing benefits, too), you can also sort your files that way.

user2569332
  • 555
  • 1
  • 4
  • 12
  • +1 for self.walk( I use my own to). But you shoud get a -1 for „+”. Use os.join or even better, "%s/%s" % (folder, item). Put a simple time() around and you will see the difference. directContents is much faster if = ["%s/%s" % (folder, x) for x in os.listdir(folder)] – cox Sep 29 '13 at 18:44
  • Fair enough. Good suggestion, and thanks for correcting instead of unhelpfully down voting. – user2569332 Oct 02 '13 at 20:31
  • I would consider making this a generator object instead of a function that returns a list. In most cases it's unnecessary to create the entire list of files in one go, and it would be easy to do `list(contents)` if you do. – ali_m Oct 02 '13 at 20:41
  • generator is the right solution if you have a big path to scan, but if you need aditional things (like sorting by human alg or exclude some of the files) a in memory list or dict is faster – cox Oct 03 '13 at 14:57
  • sure, but then you just do `list()` etc., which will be just as fast as generating a list in the first place. with a generator you have the additional option of not creating the whole thing at once in memory. – ali_m Oct 04 '13 at 19:36