3

I've written a script to crawl directories on my system and record file meta data. I've used os.walk to do this.

It has worked for the most part, but when running on different machines it returns a different list of files.

Right now I'm testing on my Dropbox folder; on my MBPro(lion) it crawls the folder and returns the correct number of files. On my iMac(mountain lion) it does not, normally skipping between 1-3 files per run. Additional crawls will pickup a straggler but usually it will continue to ignore a few files in the directory.

here's a short snippet of the code:

directory = '/Users/user/Dropbox/'
for dirname, dirnames, filenames in os.walk(directory):
  for subdirname in dirnames:
    for filename in filenames:
      if os.path.isfile(filename):
        # collect file info using os.path and os.stat

I obviously want to ignore directories. Is there a better way to do this? Preferably something that will be os agnostic.

frankV
  • 5,353
  • 8
  • 33
  • 46
  • 1
    You do *not* need to loop over the dirnames if all you are doing is collecting information on the filenames. – Martijn Pieters Dec 04 '12 at 17:29
  • But what if I want to store the full path as part of the meta data? – frankV Dec 04 '12 at 17:29
  • 1
    The `dirnames` are subdirectories of the current path and siblings of the `filenames`. For full paths, use `dirname`. It's just that directories in `dirname`, and filenames in `dirname` are listed separately. – Martijn Pieters Dec 04 '12 at 17:30
  • I'm using this `fullPath = os.path.join(dirname, filename)` – frankV Dec 04 '12 at 17:30
  • 1
    Exactly, so you do not need to loop over `dirnames`. You are not using the values of `dirnames`. – Martijn Pieters Dec 04 '12 at 17:31
  • So I can rewrite the second line to `for dirname, filenames in os.walk(directory):` ? – frankV Dec 04 '12 at 17:32
  • 1
    No, you can't. But you can remove the `for subdirname in dirnames:` loop altogether. – Martijn Pieters Dec 04 '12 at 17:33
  • Also, you should not need to use `os.path.isfile`, because *all entries in `filenames`* are files, not directories. – Martijn Pieters Dec 04 '12 at 17:34
  • I've tried that. os.path has been called on directories and terminated the program without it. – frankV Dec 04 '12 at 17:35
  • The files you are losing. Do you have access to them? – f p Dec 04 '12 at 17:37
  • I took out that line and the script no longer searches filenames in the subdirectories – frankV Dec 04 '12 at 17:37
  • @fp yes I do have access to them. – frankV Dec 04 '12 at 17:38
  • @frankV: Then the OS is providing inconsistent information on your files. Either dropbox is changing the files constantly, your filesystem is corrupt, or you have a OS-level corruption. It is most certainly not a python problem. – Martijn Pieters Dec 04 '12 at 17:40
  • 1
    @frankV: The files in subdirectories will be searched in the *next* iteration of the loop. The `directories` list is *mostly* supplied so that you can alter ordering and / or add or remove directories to be searched next in a breath-first search. – Martijn Pieters Dec 04 '12 at 17:40

1 Answers1

2

The trick is like @MartijnPieters suggested. It is unnecessary to loop over the sub-directories as well because they are picked up in the next iteration of the loop. This was cause for the discrepancies between my two machines.

Also it is important to note that OSX has a very odd way of calculating files in a given directory. You can see this by running df on a given directory and then doing 'Get Info' and comparing the results.

directory = '/Users/user/Dropbox/'
for dirname, dirnames, filenames in os.walk(directory):
    for filename in filenames:
        if os.path.isfile(filename):
            # collect file info using os.path and os.stat'   
frankV
  • 5,353
  • 8
  • 33
  • 46