Remove all files older than X days, but keep at least the Y youngest

Question

I have a script that removes DB dumps that are older than say X=21 days from a backup dir:

DB_DUMP_DIR=/var/backups/dbs
RETENTION=$((21*24*60))  # 3 weeks

find ${DB_DUMP_DIR} -type f -mmin +${RETENTION} -delete

But if for whatever reason the DB dump jobs fails to complete for a while, all dumps will eventually be thrown away. So as a safeguard i want to keep at least the youngest Y=7 dumps, even it all or some of them are older than 21 days.

I look for something that is more elegant than this spaghetti:

DB_DUMP_DIR=/var/backups/dbs
RETENTION=$((21*24*60))  # 3 weeks
KEEP=7

find ${DB_DUMP_DIR} -type f -printf '%T@ %p\n' | \  # list all dumps with epoch
sort -n | \                                         # sort by epoch, oldest 1st
head --lines=-${KEEP} |\                            # Remove youngest/bottom 7 dumps
while read date filename ; do                       # loop through the rest
    find $filename -mmin +${RETENTION} -delete      # delete if older than 21 days
done

(This snippet might have minor bugs - Ignore them. It's to illustrate what i can come up with myself, and why i don't like it)

Edit: The find option "-mtime" is one-off: "-mtime +21" means actually "at least 22 days old". That always confused me, so i use -mmin instead. Still one-off, but only a minute.

I vote to close this question as a duplicate to a newer question as none of the answers below seem to properly answer the question. The presented duplicate has a perfect valid answer. — kvantour, Feb 15 '21 at 10:18
mtime is easier to use when working in days. -mmin n File's data was last modified n minutes ago. -mtime n File's data was last modified n*24 hours ago — gaoithe, Nov 09 '21 at 20:43

score 4 · Answer 1 · answered Dec 03 '13 at 19:00

4

Use find to get all files that are old enough to delete, filter out the $KEEP youngest with tail, then pass the rest to xargs.

find ${DB_DUMP_DIR} -type f -printf '%T@ %p\n' -mmin +$RETENTION |
  sort -nr | tail -n +$KEEP |
  xargs -r echo

Replace echo with rm if the reported list of files is the list you want to remove.

(I assume none of the dump files have newlines in their names.)

answered Dec 03 '13 at 19:00

chepner

497,756
71
530
681

2

This (like David's answer) would always leave 7 files older than $RETENTION even if not required. Note that "tail -n +$KEEP" is one-off, it should be "tail -n +$((KEEP+1))" I like "xargs", i'll play with that. Though one still has to strip the epochs. – Nils Toedtmann Dec 03 '13 at 21:54
this doesn't work if KEEP – Orabîg Mar 08 '19 at 07:39
Same as comments above, but maybe more clearly stated: you cannot filter on modification time before you do sort and tail, then you will only keep $KEEP of those older than $RETENTION, which is not what is desired, neither here, nor in general. – Marcus Philip Dec 01 '20 at 16:12

score 2 · Answer 2 · answered Dec 04 '13 at 01:10

I'm opening a second answer because I just I have a different solution - one using awk: just add the time to the 21 day (in seconds) period, minus the current time and remove the negative ones! (after sorting and removing the newest 7 from the list):

DB_DUMP_DIR=/var/backups/dbs
RETENTION=21*24*60*60  # 3 weeks
CURR_TIME=`date +%s`

find ${DB_DUMP_DIR} -type f -printf '%T@ %p\n' | \
  awk '{ print int($1) -'${CURR_TIME}' + '${RETENTION}' ":" $2}' | \
  sort -n | head -n -7 | grep '^-' | cut -d ':' -f 2- | xargs rm -rf

Not quoting `${RETENTION}` means it could be expanded by the shell. The risk is small but the fix is easy. (Ideally these variables should also be converted to lower case.) — tripleee, Jul 18 '19 at 05:17

score 2 · Answer 3 · answered Nov 18 '16 at 19:49

2

None of these answers quite worked for me, so I adapted chepner's answer and came to this, which simply retains the last $KEEP backups.

find ${DB_DUMP_DIR} -printf '%T@ %p\n' | # print entries with creation time
  sort -n |                              # sort in date-ascending order
  head -n -$KEEP |                       # remove the $KEEP most recent entries
  awk '{ print $2 }' |                   # select the file paths
  xargs -r rm                            # remove the file paths

I believe chepner's code retains the $KEEP oldest, rather than the youngest.

answered Nov 18 '16 at 19:49

ireardon

392
1
5
12

Thanks! I flew with it: https://stackoverflow.com/a/52230709/3124256 ;) – DylanYoung Sep 07 '18 at 23:34
`head | awk` can probably be refactored to just an Awk script quite easily. – tripleee Jul 18 '19 at 05:18
But this answer ignores the $RETENTION – Marcus Philip Dec 01 '20 at 16:15
For the retention you can also just add "-mtime +21" to the find command to only consider files older than 21 days old. – Edmond Burnett Jun 19 '22 at 19:06

David W. · Answer 4 · 2013-12-04T17:07:02.320

You can use -mtime instead of -mmin which means you don't have to calculate the number of minutes in a day:

find $DB_DUMP_DIR -type f -mtime +21

Instead of deleting them, you could use stat command to sort the files in order:

find $DB_DUMP_DIR -type f -mtime +21 | while read file
do
    stat -f "%-10m %40N" $file
done | sort | awk 'NR > 7 {print $2}'

This will list all files older than 21 days, but not the seven youngest that are older than 21 days.

From there, you could feed this into xargs to do the remove:

find $DB_DUMP_DIR -type f -mtime +21 | while read file
do
    stat -f "%-10m %40N" $file
done | sort | awk 'NR > 7 {print $2]' | xargs rm

Of course, this is all assuming that you don't have spaces in your file names. If you do, you'll have to take a slightly different tack.

This will also keep the seven youngest files over 21 days old. You might have files younger than that, and don't want to really keep those. However, you could simply run the same sequence again (except remove the -mtime parameter:

find $DB_DUMP_DIR -type f |  while read file
do
    stat -f "%-10m %40N" $file
done | sort | awk 'NR > 7 {print $2} | xargs rm

You need to look at your stat command to see what the options are for the format. This varies from system to system. The one I used is for OS X. Linux is different.

Let's take a slightly different approach. I haven't thoroughly tested this, but:

If all of the files are in the same directory, and none of the file names have whitespace in them:

ls -t | awk 'NR > 7 {print $0}'

Will print out all of the files except for the seven youngest files. Maybe we can go with that?

current_seconds=$(date +%S)   # Seconds since the epoch
((days = 60 * 60 * 24 * 21))  # Number of seconds in 21 days
((oldest_allowed = $current_seconds - $days)) # Oldest allowed file
ls -t | awk 'NR > 7 {print $0}' | stat -f "%Dm %N" $file | while date file
do
    [ $date < $oldest_allowed ] || rm $file
done

The ls ... | awk will shave off the seven youngest. After that, we can take stat to get the name of the file and the date. Since the date is seconds after the epoch, we had to calculate what 21 days prior to the current time would be in seconds before the epoch.

After that, it's pretty simple. We look at the date of the file. If it's older than 21 days before the epoch (i.e., it's timestamp is lower) we can delete it.

As I said, I haven't thoroughly tested this, but this will delete all files over 21 days, and only files over 21 days, but always keep the seven youngest.

I don't use -mtime because is one-off: "-mtime +21" means actually "at least 22 days old". That always confuses me, so i use -mmin instead. Probably still one-off, but i am OK with being off my a minute. — Nils Toedtmann, Dec 03 '13 at 21:26
As you say this would always leave me "the seven youngest files over 21 days old" even when not needed. And the last command would only leave me the 7 youngest overall. Interesting, but not answering my question. — Nils Toedtmann, Dec 03 '13 at 22:01
using `find ... -printf "%T@ %p"` allows you to remove the while-stat loop — glenn jackman, Dec 04 '13 at 13:12
I was wondering why you used `-mmin`. Thanks for the explanation. I wanted to be able to delete all files over 21 days old, but keep at least 7. Maybe a better way (if they're all in a single directory would be `ls -t | stat .... | awk` and in the awk program, if the date >= 21 days, delete it. Maybe I'll modify my answer to use that. This would eliminate the seven youngest, but then delete any over 21 days while keeping the rest. — David W., Dec 04 '13 at 16:44
Okay, I've added a second approach. I use a Mac, so I don't have all GNU utilities. For example, my `date` and `stat` commands are a wee bit different. — David W., Dec 04 '13 at 17:08
@DavidW. You get the gnu utils from homebrew. Highly recommend. — DylanYoung, Sep 07 '18 at 23:41

ds77 · Answer 5 · 2021-01-01T15:37:18.267

What I ended up using is:

always keep last N items
then for the rest, if the file is older than X days, delete it

for f in $(ls -1t | tail -n +31); do
   if [[ $(find "$f" -mtime +30 -print) ]]; then
      echo "REMOVING old backup: $f"
      rm $f
   fi
done

explanation:

ls, sort by time, skip first 30 items: $(ls -1t | tail -n +31)

test if find can find the file being older than 30 days: if [[ $(find "$f" -mtime +30 -print) ]]

score 0 · Answer 6 · answered Dec 03 '13 at 18:47

0

You could do the loop yourself:

t21=$(date -d "21 days ago" +%s)
cd "$DB_DUMP_DIR"
for f in *; do
    if (( $(stat -c %Y "$f") <= $t21 )); then
        echo rm "$f"
    fi
done

I'm assuming you have GNU date

answered Dec 03 '13 at 18:47

glenn jackman

238,783
38
220
352

Thanks for 'date -d "21 days ago" +%s', i didn't know that. So in my sample script, i could alter the while loop block to: [ "${date%\.[0-9]*}" -lt "${t21}" ] && echo rm ${filename} – Nils Toedtmann Dec 03 '13 at 22:25
Yes. However using bash's `[[ ]]` means you need less quoting: `[[ ${date%.*} -lt $t21 ]]`. Also dot is not a special glob character so you don't have to escape it: `${date%\.[0-9]*}` means "remove a dot followed by a digit followed by zero or more of any character". If you want to remove strictly digits, you'll need `shopt -s extglob` then `${date%.*([0-9])}` -- see http://www.gnu.org/software/bash/manual/bashref.html#Pattern-Matching – glenn jackman Dec 04 '13 at 13:18
But this does nothing to keep the required number of files regardless of how old they are. – tripleee Jul 18 '19 at 05:20

DylanYoung · Answer 7 · 2018-09-08T00:06:40.817

0

Here is a BASH function that should do the trick. I couldn't avoid two invocations of find easily, but other than that, it was a relative success:

#  A "safe" function for removing backups older than REMOVE_AGE + 1 day(s), always keeping at least the ALWAYS_KEEP youngest
remove_old_backups() {
    local file_prefix="${backup_file_prefix:-$1}"
    local temp=$(( REMOVE_AGE+1 ))  # for inverting the mtime argument: it's quirky ;)
    # We consider backups made on the same day to be one (commonly these are temporary backups in manual intervention scenarios)
    local keeping_n=`/usr/bin/find . -maxdepth 1 \( -name "$file_prefix*.tgz" -or -name "$file_prefix*.gz" \) -type f -mtime -"$temp" -printf '%Td-%Tm-%TY\n' | sort -d | uniq | wc -l`
    local extra_keep=$(( $ALWAYS_KEEP-$keeping_n ))

    /usr/bin/find . -maxdepth 1 \( -name "$file_prefix*.tgz" -or -name "$file_prefix*.gz" \) -type f -mtime +$REMOVE_AGE -printf '%T@ %p\n' |  sort -n | head -n -$extra_keep | cut -d ' ' -f2 | xargs -r rm
}

It takes a backup_file_prefix env variable or it can be passed as the first argument and expects enviroment variables ALWAYS_KEEP (minimum number of files to keep) and REMOVE_AGE (num days to pass to -mtime). It expects a gz or tgz extension. There are a few other assumptions as you can see in the comments, mostly in the name of safety.

Thanks to ireardon and his answer (which doesn't quite answer the question) for the inspiration!

Happy safe backup management :)

edited Sep 08 '18 at 00:06

answered Sep 07 '18 at 23:29

DylanYoung

2,423
27
30

As you can see, I prefer the quirks of `mtime` to manually computing the minutes. Using `mmin`, you should be able to remove the quirky `temp` variable at the cost of minor indeterminacy in results when backups are created close to the time the function is called: nothing disastrous. – DylanYoung Sep 07 '18 at 23:47
1

How about lisibility ? :) – Orabîg Mar 08 '19 at 07:38
1

Huh? You mean legibility or usability? Well, you can rename the "temp" variable to something more meaningful (`inverted_mtime`) and add some more configurations (per file type for example). Otherwise, if you know `bash` and are familiar with `find`, `sort`, `head, `uniq`, `wc`, `cut`, and `xargs` (pretty standard unix tools), this should be perfectly legible to you. If you aren't, they're just a few man pages or google searches away. – DylanYoung Mar 08 '19 at 20:12
Hardcoding the path to `find` is just wacky. Simply make sure your `PATH` is correct. – tripleee Jul 18 '19 at 05:21
@tripleee Users sometimes modify their PATHs. This ensures that only the system find is ever used. I'd prefer if there was a more standard way to access the system version of a program, but I don't know of one. If you're confident in your users setting their paths right, by all means remove the prefix :) – DylanYoung Jul 18 '19 at 16:54
@tripleee I suppose I could explicitly set the PATH at the top (to root's PATH maybe?). That would probably be cleaner. – DylanYoung Jul 18 '19 at 16:57
1

Yes, exactly. But usually you would trust the user ho have a sane path for system utilities; or a good reason to want to override the system version, which you will break by overriding their preference. – tripleee Jul 18 '19 at 18:04
Like I said, if you're distributing it, by all means go ahead and make your alterations! I write my programs to run without errors where they're intended to run. Some programs are supposed to run everywhere. Some are run on specific OSes with known configurations, and tailoring to that environment really reduces errors caused by people trying to use it in the WRONG context. If you'd like to propose a better standard for canonically accessing system utilities that should be on every system, I'm all ears for that. – DylanYoung Jul 19 '19 at 18:10

Orabîg · Answer 8 · 2019-03-07T16:47:55.090

From the solutions given in the other solutions, I've experimented and found many bugs or situations that were not wanted.

Here is the solution I finally came up with :

  # Sample variable values
  BACKUP_PATH='/data/backup'
  DUMP_PATTERN='dump_*.tar.gz'
  NB_RETENTION_DAYS=10
  NB_KEEP=2                    # keep at least the 2 most recent files in all cases

  find ${BACKUP_PATH} -name ${DUMP_PATTERN} \
    -mtime +${NB_RETENTION_DAYS} > /tmp/obsolete_files

  find ${BACKUP_PATH} -name ${DUMP_PATTERN} \
    -printf '%T@ %p\n' | \
    sort -n            | \
    tail -n ${NB_KEEP} | \
    awk '{ print $2 }'   > /tmp/files_to_keep

  grep -F -f /tmp/files_to_keep -v /tmp/obsolete_files > /tmp/files_to_delete

  cat /tmp/files_to_delete | xargs -r rm

The ideas are :

Most of the time, I just want to keep files that are not aged more than NB_RETENTION_DAYS.
However, shit happens, and when for some reason there are no recent files anymore (backup scripts are broken), I don't want to remove the NB_KEEP more recent ones, for security (NB_KEEP should be at least 1).

I my case, I have 2 backups a day, and set NB_RETENTION_DAYS to 10 (thus, I normally have 20 files in normal situation) One could think that I would thus set NB_KEEP=20, but in fact, I chose NB_KEEP=2, and that's why :

Let's imagine my backup scripts are broken, and I don't have backup for a month. I really don't care having my 20 latest files that are more than 30 days old. Having at least one is what I want. However, being able to easily identify that there is a problem is very important (obviously my monitoring system is really blind, but that's another point). And having my backup folder having 10 times less files than usual is maybe something that could ring a bell...

Looks like my solution, except you created three temporary files and did some extra grepping :p — DylanYoung, Mar 25 '19 at 15:49
You really want to avoid temporary files. If they cannot be avoided, you really **MUST** avoid using static temporary file names. The solution is called `mktemp`. — tripleee, Jul 18 '19 at 05:18

Remove all files older than X days, but keep at least the Y youngest

8 Answers8

Linked