0

I have a python script that cache some information to a file. The file will be reused if exists. Otherwise the script will call some other functions, which takes a long time, to generate such file. The name of file has certain patterns, and they are all stored in a $WORKING_DIRECTORY

def dummy(param):
    fname = "file"+params
    if fname exists in $WORKING_DIR:
        reuse file
    else:
        long_time_process(param)
        create file in $WORKING_DIR

Since this dummy functions will be called multiple times with different params, a lot of files will be generated. I want to keep the size of directory moderate and the information in the files relatively up-to-date. So I want to set a THRESHOLD for the size of directory. Once the limit is reached, I will remove the oldest files till the size of directory is reduced to half of of THRESHOLD.

My current solution is:

def dummy(param):
    purge($WORKING_DIR)
    ...#Rest of dummy logic

def purge(dir):
    if get_size(dir) > THRESHOLD:
        while get_size(dir) > THRESHOLD/2:
            remove oldest file

def get_size(dir):
    size = 0
    for file in dir:
         size = size + file.size()
    return size

This surely does the work, but the call to purge is unnecessary most of times, since it will only reach the THRESHOLD once after a thousand calls. On top that, get_size of directory could also be time consuming if number of files is huge.

So the question is how do I optimize the get_size and integrate the purge logic with my current dummy function? Is there a good pythonic way to do it? Or a pattern I can use? Thanks

cookieisaac
  • 1,357
  • 5
  • 18
  • 35
  • I just tried this using os.stat(filename)[6] on large files and many files and it was still plenty fast. How many files do you expect in this directory? Why not just call get_size every 1000 iterations? – tnknepp Feb 01 '16 at 16:59
  • @tnknepp The file is relatively small around 10 KB ~ 100KB and the threshold for the directory should be around 100 MB. So it will be 1000 ~ 10000 files. The goal is to keep directory size moderate. Setting a counter for get_size might not be a good indicator. And this is a python _script_ that's invoked as `python dummy.py` by its caller, global variable should not work. – cookieisaac Feb 01 '16 at 19:36

2 Answers2

0

you could use a global variable to keep count of how many times the function has been called

global count

def dummy(param):
    count +=1
    if count > THRESHOLD:
        purge($WORKING_DIR)

    ...#Rest of dummy logic

def purge(dir):
    while count > THRESHOLD/2:
        remove oldest file
        count-=1

def get_size(dir):
    size = 0
    for file in dir:
         size = size + file.size()
    return size
danidee
  • 9,298
  • 2
  • 35
  • 55
  • This does not remove the multitude of get_size() calls, which the OP claimed to be part of the bottleneck. Wouldn't calling get_size every 1000 iterations work better? – tnknepp Feb 01 '16 at 17:02
  • @danidee Thanks for the answer, I end up using the _number_ of files as a purging metric, and set a explicit target for the puger level. I ordered the files by last accessed time and removed files in range [target:] to avoid the frequent get_size operation. – cookieisaac Mar 10 '16 at 18:52
0

I end up using the number of files as a purging metric, and set a explicit target for the puger level. I ordered the files by last accessed time and removed files in range [target:] to avoid the frequent get_size operation.

The skeleton of my purge logic is something as follow. I can purge this way because the size of my files are typically small and number of files is a good indicator to the total size.

def purge(dir, filepattern):
     files = glob.glob(os.path.join(directory, filepattern))
     files_count=len(files)
     if files_count > THRESHOLD:
         files.sort(key=lambda f: os.path.getatime(f), reverse=True)
         for f in files[TARGET:]:
             os.remove(f)
cookieisaac
  • 1,357
  • 5
  • 18
  • 35