-1

I have a set of jobs (job1, job2 etc) that runs every hour and after they are completed generate folders (session1, session2 etc) which contains the log files. Due to storage limitation, I need a script that can delete the session directories older than a set time limit but also want to specify that it must keeps a specified number of session directories e.g keep the latest 2 sessions, even if they are older than set time limit.

How can I achieve this using python os.walk()? I want to return a list of session dirs to delete sessions_to_delete = []

/root    
    /job1             (runs every one hour)    
        /session1
            /*log
        /session2
        /session3
    /job2
        /session1
        /session2
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Emmanuel Osimosu
  • 5,625
  • 2
  • 38
  • 39

2 Answers2

1

In this case, it is probably easier to list all directories with glob.glob(), to match your hierarchy pattern. You can use os.path.getctime() to get a timestamp for each directory to sort and filter by

from glob import glob
import os.path
import time

def find_sessions_to_delete(cutoff):
    # produce a list of (timestamp, path) tuples for each session directory
    session_dirs = [(os.path.getctime(p), p) for p in glob('/root/job*/session*')]
    session_dirs.sort(reverse=True)  # sort from newest to oldest
    # remove first two elements, they are kept regardless
    session_dirs = session_dirs[2:]
    # return a list of paths whose ctime lies before the cutoff time
    return [p for t, p in session_dirs if t <= cutoff]

cutoff = time.time() - (7 * 86400)  # 7 days ago
sessions_to_delete = find_sessions_to_delete(cutoff)

I included a sample cutoff date at 7 days ago, calculated from time.time(), which returns an integer value, expressing the number of seconds passed since the 1st of January 1970 (the UNIX epoch).

If you needed to do this per job directory, do the same work per such directory and merge the resulting lists:

def find_sessions_to_delete(cutoff):
    to_delete = []

    # process each jobdir separately
    for jobdir in glob('/root/job*'):
        # produce a list of (timestamp, path) tuples for each session directory
        session_dirs = [(os.path.getctime(p), p)
                        for p in glob(os.path.join(jobdir, 'session*'))]
        session_dirs.sort(reverse=True)  # sort from newest to oldest
        # remove first two elements, they are kept regardless
        session_dirs = session_dirs[2:]
        # Add list of paths whose ctime lies before the cutoff time
        to_delete.extend(p for t, p in session_dirs if t <= cutoff)

    return to_delete
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

You can use os.path.getatime(path) or os.path.getmtime(path) to figure out how "old" is a folder and then do what you need to do with it... Here the basic info about the os.path module https://docs.python.org/2/library/os.path.html#module-os.path

one approach to solve your problem could be this one:

import os
import time

for folder in list_of_folders:
    if time.time() - os.path.getmtime(folder) > time_limit:
        delete_folder(folder)

if you build up the list_of_folders using append() then you can save the last two folders by changing the for loop easily like this.

for folder in list_of_folders[:-2]:
alec_djinn
  • 10,104
  • 8
  • 46
  • 71