Limitation to Python's glob?

Question

I'm using glob to feed file names to a loop like so:

inputcsvfiles = glob.iglob('NCCCSM*.csv')

for x in inputcsvfiles:

    csvfilename = x
    do stuff here

The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.

Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?

PS- initially I was using glob.glob, but glob.iglob fairs no better.

Edit:

An expansion of above for more context...

    # typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.   
    inputcsvfiles = glob.iglob('NCCCSM*.csv')

    # loop over individial input files    
      for x in inputcsvfiles:

        csvfile = x
        modelname = x[0:5]

        # ArcPy
        arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")

        do more stuff after

The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.

does `print(sum(1 for _ in glob.iglob('NCCCSM*.csv')))` print correct number of files? — jfs, Jul 26 '12 at 18:09
Works for me. (Python 2.7 on OS X). Are you sure you didn't change `do stuff here` in between testing with 2 files and 10959? — Wooble, Jul 26 '12 at 18:11
@Wooble - positive, just re-ran the same code and works fine with 100 csv files (Python 2.6.5, Windows7 64-bit) — Prophet60091, Jul 26 '12 at 18:41
@Prophet60091: it means glob works as expected. Note: iglob returns an iterator, you can only iterate once over all files otherwise use glob.glob(). — jfs, Jul 26 '12 at 18:47
@J.F.Sebastian - I guess this means that there's a problem with "do stuff here". Infuriating! Thanks. — Prophet60091, Jul 26 '12 at 19:07

score 1 · Answer 1 · answered Jul 26 '12 at 18:11

1

Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?

#credit - @dabeaz - generators tutorial

import os
import fnmatch

def gen_find(filepat,top):
    for path, dirlist, filelist in os.walk(top):
        for name in fnmatch.filter(filelist,filepat):
            yield os.path.join(path,name)

# Example use

if __name__ == '__main__':
    lognames = gen_find("NCCCSM*.csv",".")
    for name in lognames:
        print name

answered Jul 26 '12 at 18:11

Senthil Kumaran

54,681
14
94
131

This could potentially yield a lot more files than the original post requested. – mgilson Jul 26 '12 at 18:14
5

glob will work even if `ls *` fails. glob, os.walk call the same os.listdir() that returns file names as a list. 10000 is a small number. – jfs Jul 26 '12 at 18:15
@mgilson a extra check could be added. – Senthil Kumaran Jul 26 '12 at 18:17
@J.F.Sebastian --(good comment) I think that it is important to point out that `glob('*')` (as far as implementation is concerned) is a lot closer to doing `ls` than it is to doing `ls *`. – mgilson Jul 26 '12 at 18:19
1

You can easily show that `ls *` works fine for 10,000 files. `touch NCCCSM{0..9999}.csv && ls *`. No failure here and it's quite fast. – kojiro Jul 26 '12 at 18:20
@kojiro: in your example `ls`, `touch` work for the same reason: because `execve()` with all 1000 files at once succeeds, i.e., both `*` and `{..}` are handled by bash. The limit on how much you can put in a single command is probably smaller (as small as 128K) than the max number of entries in a directory. You can read directories with millions files (to avoid large memory consumption use [`readdir()`](http://stackoverflow.com/a/5091076/4279) directly, or [`getdents()`](http://www.olark.com/spw/2011/08/you-can-list-a-directory-with-8-million-files-but-not-with-ls/) on Linux for speed. – jfs Jul 26 '12 at 20:15
@J.F.Sebastian *The limit on how much you can put in a single command…* but `touch` is a command, as is `ls`. So the only limitation would be if you were asking the command itself to expand the glob, but that isn't even relevant to this question, is it? – kojiro Jul 26 '12 at 20:51
@kojiro: It should be read as "put in a single command-**line**" i.e., it refers to `execve()`. – jfs Jul 26 '12 at 21:54
@J.F.Sebastian I think you missed the point I was trying to make. The fact that bash does brace and glob expansion before the command is ever executed is exactly my point: That `MAX_ARGS` is (on most modern systems) much larger than 10k and that this answer is wrong unless OP is referring to a rare or embedded system. – kojiro Jul 26 '12 at 22:41

score 1 · Answer 2 · edited Aug 13 '15 at 00:45

1

If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.

There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).

edited Aug 13 '15 at 00:45

Patrick Maupin

8,024
2
23
42

answered Jul 26 '12 at 22:01

jfs

399,953
195
994
1,670

score 1 · Accepted Answer · answered Aug 20 '14 at 02:05

One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.

My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.

Limitation to Python's glob?

3 Answers3

Linked