3

While checking the efficiency of os.walk, I created 6,00,000 files with the string Hello <number> (where number is just a number indicating the number of the file in the directory), e.g. the contents of the files in the directory would look like:-

File Name | Contents
1.txt     | Hello 1
2.txt     | Hello 2
.
.
600000.txt|Hello 600000

Now, I ran the following code:-

a= os.walk(os.path.join(os.getcwd(),'too_many_same_type_files')) ## Here, I am just passing the actual path where those 6,00,000 txt files are present
print a.next()

The problem what I felt was that the a.next() takes too much time and memory, because the 3rd item that a.next() would return is the list of files in the directory (which has 600000 items). So, I am trying to figure out a way to reduce the space complexity (at least) by somehow making a.next() to return a generator object as the 3rd item of the tuple, instead of list of file names.

Would that be a good idea to reduce the space complexity?

Santosh Kumar
  • 26,475
  • 20
  • 67
  • 118
GodMan
  • 2,561
  • 2
  • 24
  • 40
  • 5
    It's a bad idea to have 600,000 files in the same directory, because it will seriously impair filesystem performance. And even if you do so, storing 600,000 file names in memory typically uses around 20 MB of space. These 20 MB certainly aren't the worst effect of having that many files in a single directory. My recommendation is to fix the *actual* problem. – Sven Marnach Aug 16 '12 at 16:47
  • I'm pretty sure this depends on the filesystem used – Useless Aug 16 '12 at 16:49
  • It's not really possible to return a generator of filenames because of how the *operating system* returns the files... – Wayne Werner Aug 16 '12 at 16:55
  • 2
    @WayneWerner: At least on Linux, [`readdir()`](http://linux.die.net/man/3/readdir) gives you the direcotry entries one by one, and I remember this is also how it worked on DOS. Don't know about Windows, though… – Sven Marnach Aug 16 '12 at 17:12
  • 2
    @Useless: Pretty much every widely-used filesystem gets slow with too many files in a single directory. If I remember correctly, XFS is an exception to this rule, but the chances the OP is using XFS are negligible. – Sven Marnach Aug 16 '12 at 17:14

3 Answers3

1

It's such a good idea, that's the way the underlying C API works!

If you can get access to readdir, you can do it: unfortunately this isn't directly exposed by Python.

This question shows two approaches (both with drawbacks).

A cleaner approach would be to write a module in C to expose the functionality you want.

Community
  • 1
  • 1
Useless
  • 64,155
  • 6
  • 88
  • 132
  • I didn't have a look at the other question, but in Python you can use `glob.iglob("*")` and `glob.iglob(".*")` to egt an iterator over all directory entries. I don't see any severe drawback to this approach. – Sven Marnach Aug 16 '12 at 17:10
1

As folks have mentioned already, 600,000 files in a directory is a bad idea. Initially I thought that there's really no way to do this because of how you get access to the file list, but it turns out that I'm wrong. You could use the following steps to achieve what you want:

  1. Use subprocess or os.system to call ls or dir (whatever OS you happen to be on). Direct the output of that command to a temporary file (say /tmp/myfiles or something. In Python there's a module that can return you a new tmp file).

  2. Open that file for reading in Python.

  3. File objects are iterable and will return each line, so as long as you have just the filenames, you'll be fine.

Wayne Werner
  • 49,299
  • 29
  • 200
  • 290
1

os.walk calls listdir() under the hood to retrieve the contents of the root directory then proceeds to split the returned list of items to dirs and non-dirs.

To achieve what you want you'll need to dig much lower down and implement not only your own version of walk() but also an alternative listdir() that returns a generator. Note that even then you will not be able to provide independent generators for both dirs and files unless you make two separate calls to the modifiedlistdir() and filter the results on the fly.

As suggested by Sven in the comments above, it might be better to address the actual problem (too many files in a dir) rather than over-engineer a solution.

Community
  • 1
  • 1
Shawn Chin
  • 84,080
  • 19
  • 162
  • 191