Retrieving specific directories using os.walk()

Question

I have a set of jobs (job1, job2 etc) that runs every hour and after they are completed generate folders (session1, session2 etc) which contains the log files. Due to storage limitation, I need a script that can delete the session directories older than a set time limit but also want to specify that it must keeps a specified number of session directories e.g keep the latest 2 sessions, even if they are older than set time limit.

How can I achieve this using python os.walk()? I want to return a list of session dirs to delete sessions_to_delete = []

/root    
    /job1             (runs every one hour)    
        /session1
            /*log
        /session2
        /session3
    /job2
        /session1
        /session2

To sort candidates you may use `os.stat`. – Łukasz Rogalski Feb 04 '15 at 12:13 — Łukasz Rogalski, Feb 04 '15 at 12:13
Yes. Latest 2 for all jobs – Emmanuel Osimosu Feb 04 '15 at 12:51 — Emmanuel Osimosu, Feb 04 '15 at 12:51

Martijn Pieters · Accepted Answer · 2015-02-06T10:27:40.570

In this case, it is probably easier to list all directories with glob.glob(), to match your hierarchy pattern. You can use os.path.getctime() to get a timestamp for each directory to sort and filter by

from glob import glob
import os.path
import time

def find_sessions_to_delete(cutoff):
    # produce a list of (timestamp, path) tuples for each session directory
    session_dirs = [(os.path.getctime(p), p) for p in glob('/root/job*/session*')]
    session_dirs.sort(reverse=True)  # sort from newest to oldest
    # remove first two elements, they are kept regardless
    session_dirs = session_dirs[2:]
    # return a list of paths whose ctime lies before the cutoff time
    return [p for t, p in session_dirs if t <= cutoff]

cutoff = time.time() - (7 * 86400)  # 7 days ago
sessions_to_delete = find_sessions_to_delete(cutoff)

I included a sample cutoff date at 7 days ago, calculated from time.time(), which returns an integer value, expressing the number of seconds passed since the 1st of January 1970 (the UNIX epoch).

If you needed to do this per job directory, do the same work per such directory and merge the resulting lists:

def find_sessions_to_delete(cutoff):
    to_delete = []

    # process each jobdir separately
    for jobdir in glob('/root/job*'):
        # produce a list of (timestamp, path) tuples for each session directory
        session_dirs = [(os.path.getctime(p), p)
                        for p in glob(os.path.join(jobdir, 'session*'))]
        session_dirs.sort(reverse=True)  # sort from newest to oldest
        # remove first two elements, they are kept regardless
        session_dirs = session_dirs[2:]
        # Add list of paths whose ctime lies before the cutoff time
        to_delete.extend(p for t, p in session_dirs if t <= cutoff)

    return to_delete

I want to keep the latest two session from each job...if there is only 1 session, keep that one.. dont remove anythin? — Emmanuel Osimosu, Feb 06 '15 at 10:15
@Emmanuel: then you'd use the same techique, only once per job directory and merge the lists. I'll add a version for that. — Martijn Pieters, Feb 06 '15 at 10:24
Thank you Martijn! this solved my problem and it's very easy to understand what the code is doing — Emmanuel Osimosu, Feb 08 '15 at 10:48
@Emmanuel: because we have more than one item to add to the list. — Martijn Pieters, Feb 08 '15 at 11:04

score 1 · Answer 2 · answered Feb 04 '15 at 12:31

You can use os.path.getatime(path) or os.path.getmtime(path) to figure out how "old" is a folder and then do what you need to do with it... Here the basic info about the os.path module https://docs.python.org/2/library/os.path.html#module-os.path

one approach to solve your problem could be this one:

import os
import time

for folder in list_of_folders:
    if time.time() - os.path.getmtime(folder) > time_limit:
        delete_folder(folder)

if you build up the list_of_folders using append() then you can save the last two folders by changing the for loop easily like this.

for folder in list_of_folders[:-2]:

Retrieving specific directories using os.walk()

2 Answers2