3

I'm looking for a way to list the sub-directories contained with the current working directory, however, I haven't been able to find a way that doesn't iterate over all files.

Essentially, if I have a folder with a large number of files and 2 folders, I want a method that can quickly return a list containing the names of the 2 folders without having to scan all of the files, too.

Is there a way to do this in Python?

Edit: I should clarify, that my question is in regards to the performance of retrieving the directories. I already know of several ways to get the directories, but they're all slowed down if the working directory has a bunch of files in it as well.

Novark
  • 419
  • 4
  • 15
  • Roughly how many entries are in the directory? – FMc Aug 01 '15 at 02:46
  • 2
    Considering that directories are nothing more than tables having records of the form `|name|inode|very-little-else|` I doubt if there will be a way to do what you're asking for. Even if there was a function to return just the directories, it would still have to iterate through all the rows trying to figure out which entry corresponds to a subdirectory and which doesn't. What I'm trying to say is that "DIRECTORY FILES" and and regular files are dumped together and not separately. – rohithpr Aug 01 '15 at 02:55
  • @FMc: There are an arbitrary number of files - I'm traversing a directory that I don't know anything about in advance, so I have to take into consideration a directory that could contain any number of files. – Novark Aug 01 '15 at 02:55
  • @NightShadeQueen - OP is looking for a way to avoid iterating over all the contents of the directory. The solution to the linked question fetches all the entries and filters out the regular files. – rohithpr Aug 01 '15 at 02:56
  • 1
    @Novark You say you notice a performance problem: at what number of entries does it occur? It makes little sense to build a program to handle an arbitrary number of entries: a million? a billion? As praroh1 notes, at the end of the day, anything you use will traverse the entries, whether explicitly in Python or behind the scenes. – FMc Aug 01 '15 at 02:58
  • @praroh1: My knowledge of the details surrounding how the OS stores files and folders is a bit rusty, but I thought that there was a way to tell the difference between a file and a directory. I may be wrong about this though... – Novark Aug 01 '15 at 02:59
  • 1
    @Novark - There is a way to differentiate between the two but you still have to examine all the entries. A couple of measures can be taken to improve performance though. Will you be accessing the same folders repeatedly or is it a one time thing? Does the folder belong to your application? – rohithpr Aug 01 '15 at 03:02
  • @FMc: Ok, let's say 50,000 files. I just tested this, and it took 1.104s on my machine for os.walk() to return the 2 directories that were placed in the same working directory. In 99.9% of cases, there will never be this many files in a single directory on someone's filesystem, however, I still need to consider that they might be doing something crazy, which will slow down my os.walk() when it hits that folder. – Novark Aug 01 '15 at 03:03
  • @Novark - sounds like premature optimization! – rohithpr Aug 01 '15 at 03:04
  • @Novark OK, that's in the extreme (but not completely insane range) :). In my experience with that N of entries, things do indeed slow down. One idea is to compare Python `os.walk` against a system call to `find -type d`. It might be a wash, or even slower. You could also run `find -type d` directly on the command line to get a lower bound. When you are benchmarking, bear in mind that the 2nd attempt to read a directory will tend to be much faster, because the OS does all sorts of caching. – FMc Aug 01 '15 at 03:08
  • @praroh1: Essentially I need to traverse a folder hierarchy that I don't know anything about in advance. I start at the node that the script is located in, and recursively walk across all child nodes. I only need to explore the hierarchy at this stage - I don't care about what files are present. I just want to be able to build up a structure that maps out all child nodes from the starting node. Also, regarding premature optimization...I actually hit the slowdown on my machine when it hit a folder with a bunch of debug files that I forgot to clean out which is how I found the issue :-) – Novark Aug 01 '15 at 03:08
  • @Novark - Is it an interactive application? Like, are you worried that the UX will be ruined by the long wait times? If yes, you could do some sort of pre-fetching at the time of starting your application and store the results in a database. – rohithpr Aug 01 '15 at 03:14
  • @praroh1: It's a framework that will be called by a user's script. If they run their script (which runs the framework), there will be an initial slowdown while the framework explores their filesystem looking for other scripts. It may be the case that I need to re-think how and when I'm doing the exploration during the framework's initialization. Another solution may be to display some sort of warning message if the initialization process takes too long. I'll have to think on it a bit more... – Novark Aug 01 '15 at 03:22
  • @Novark - I wouldn't be too worried about this issue. Most people don't dump in millions of files into a single folder. And those who do are used to some lag as just about every application they use will slow down the moment it tries to access that directory. In either case don't worry about it till you're done with your framework and it's performance isn't tolerable. – rohithpr Aug 01 '15 at 03:28
  • @praroh1: Yeah, you're probably right. I'll likely ignore it for now and maybe log a warning to the framework debug log if the initialization exceeds a few seconds. Thanks for the help. – Novark Aug 01 '15 at 03:33

2 Answers2

2

Not sure if there is any direct standard functions that would do this for you. But You can use os.walk() for this , each iteration of os.walk() returns a tuple of the format -

(dirpath, dirnames, filenames)

Where dirpath is the directory being walked currently, dirnames contains the directories inside dirpath and filenames contains the files inside it.

You can just directly call next(os.walk()) to get the above tuple result for a directory, then the second element (index - 1) in that tuple would be the sub-folders inside the directory.

Code -

direcs = next(os.walk('.'))[1]

direcs at the end would be a list of the subfolders of current folder. You can also give some other folder in there to get the list of folders inside it.

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • I've tried using os.walk and several variants thereof, however, these all iterate across the files which causes a slowdown whenever you need to walk across a folder with a bunch of files. I guess I should have clarified that my question is specifically concerning the performance of retrieving the subdirectories. – Novark Aug 01 '15 at 02:39
  • `os.walk()` does not iterate over subdirectories immediately, it only iterates over them when you iterate over `os.wak()` , in the above code, you are just getting the first level subdirectories and files, it does not iterate inside the subdirectory. – Anand S Kumar Aug 01 '15 at 02:44
  • os.walk() doesn't work for me, because it also retrieves the files in addition to the directories, which can cause a slowdown. – Novark Aug 01 '15 at 02:56
1

There isn't a way to only retrieve directories from the operating system. You have to filter the results. Although, it looks like using os.scandir improves performance by an order of magnitude (see benchmarks) over os.listdir and the older os.walk implementation since it avoids retrieving anything but metadata where possible. If you're using 3.5, it's already integrated into the standard library. Otherwise, it looks like you need to use the scandir package.

To filter the results from os.scandir

ds = [e.name() for e in os.scandir('.') if e.is_dir()]

According to the documentation, walk is implemented in terms of scandir which also gives the same speedup.

Jason
  • 3,777
  • 14
  • 27