2

path/mydir contains a list of directories. The names of these directories tell me which database they relate to.

Inside each directory is a bunch of files, but the filenames tell me nothing of importance.

I'm trying to write a command in linux bash that accomplishes the following:

  • For each directory in path/mydir, find the max timestamp of the last modified file within that directory
  • Print the last modified file's timestamp next to the parent directory's name
  • Exclude any timestamps less than 30 days old
  • Exclude specific directory names using regex
  • Order by oldest timestamp

Given this directory structure in path/mydir:

database_1
   table_1.file (last modified 2021-11-01)
   table_2.file (last modified 2021-11-01)
   table_3.file (last modified 2021-11-05)
database_2
   table_1.file (last modified 2021-05-01)
   table_2.file (last modified 2021-05-01)
   table_3.file (last modified 2021-08-01)
database_3
   table_1.file (last modified 2020-01-01)
   table_2.file (last modified 2020-01-01)
   table_3.file (last modified 2020-06-01)

I would want to output:

database_3 2020-06-01
database_2 2021-08-01

This half works, but looks at the modified date of the parent directory instead of the max timestamp of files under the directory: find . -maxdepth 1 -mtime +30 -type d -ls | grep -vE 'name1|name2'

I'm very much a novice with bash, so any help and guidance is appreciated!

ComputersAreNeat
  • 175
  • 1
  • 1
  • 11

1 Answers1

1

Would you please try the following

#!/bin/bash

cd "path/mydir/"
for d in */; do
    dirname=${d%/}
    mdate=$(find "$d" -maxdepth 1 -type f -mtime +30 -printf "%TY-%Tm-%Td\t%TT\t%p\n" | sort -rk1,2 | head -n 1 | cut -f1)
    [[ -n $mdate ]] && echo -e "$mdate\t$dirname"
done | sort -k1,1 | sed -E $'s/^([^\t]+)\t(.+)/\\2 \\1/'

Output with the provided example:

database_3 2020-06-01
database_2 2021-08-01
  • for d in */; do loops over the subdirectories in path/mydir/.
  • dirname=${d%/} removes the trailing slash just for the printing purpose.
  • printf "%TY-%Tm-%Td\t%TT\t%p\n" prepends the modification date and time to the filename delimited by a tab character. The result will look like:
2021-08-01      12:34:56        database_2/table_3.file
  • sort -rk1,2 sorts the output by the date and time fields in descending order.
  • head -n 1 picks the line with the latest timestamp.
  • cut -f1 extracts the first field with the modification date.
  • [[ -n $mdate ]] skips the empty mdate.
  • sort -k1,1 just after done performs the global sorting across the outputs of the subdirectories.
  • sed -E ... swaps the timestamp and the dirname. It just considers the case the dirname may contain a tab character. If not, you can omit the sed command by switching the order of timestamp and dirname in the echo command and changing the sort command to sort -k2,2.

As for the mentioned Exclude specific directory names using regex, add your own logic to the find command or whatever.

[Edit]
In order to print the directory name if the last modified file in the subdirectories is older than the specified date, please try instead:

#!/bin/bash

cd "path/mydir/"
now=$(date +%s)
for d in */; do
    dirname=${d%/}
    read -r secs mdate < <(find "$d" -type f -printf "%T@\t%TY-%Tm-%Td\n" | sort -nrk1,1 | head -n 1)
    secs=${secs%.*}
    if (( secs < now - 3600 * 24 * 30 )); then
        echo -e "$secs\t$dirname $mdate"
    fi
done | sort -nk1,1 | cut -f2-
  • now=$(date +%s) assigns the variable now to the current time as the seconds since the epoch.
  • for d in */; do loops over the subdirectories in path/mydir/.
  • dirname=${d%/} removes the trailing slash just for the printing purpose.
  • -printf "%T@\t%TY-%Tm-%Td\n" prints the modificaton time as seconds since the epoch and the modification date delimited by a tab character. The result will look like:
1627743600      2021-08-01
  • sort -nrk1,1 sorts the output by the modification time in descending order.
  • head -n 1 picks the line with the latest timestamp.
  • read -r secs mdate < <( stuff ) assigns secs and mdate to the outputs of the command in order.
  • secs=${secs%.*} removes the fractional part.
  • The condition (( secs < now - 3600 * 24 * 30 )) meets if secs is 30 days or more older than now.
  • echo -e "$secs\t$dirname $mdate" prints dirname and mdate prepending the secs for the sorting purpose.
  • sort -nk1,1 just after done performs the global sorting across the outputs of the subdirectories.
  • cut -f2- removes secs portion.
tshiono
  • 21,248
  • 2
  • 14
  • 22
  • This is a great response! Not only did this work properly, but you did a great job explaining exactly what the bash script does- and I really appreciate it. How might I modify this to also look at files in any subdirectories for the max modified date? For example, if `path/mydir/database_3` contained additional sub directories instead of just files? Say we had table_1.file, table_2.file, table_3.file, folder_1->table_4.file, folder_2->table5.file? – ComputersAreNeat Nov 17 '21 at 02:18
  • Thank you for the feedback. Good to know it works. As for your additional case, assuming the `table5.file` has the max modification date, which directory name do you want to print, `database_3` or `folder_2`? – tshiono Nov 17 '21 at 02:33
  • Still database_3. All files and subdirectories under database_3 should be looked at. If the most recently modified file hasn't been modified in 30 days or more, the parent directory is returned with the timestamp of that file – ComputersAreNeat Nov 17 '21 at 02:54
  • Understood. Then please drop `-maxdepth 1` of the `find` command. It'll work as you expect. – tshiono Nov 17 '21 at 03:01
  • I thought so too, and tried that before my last comment and didn't have any luck. In a real example in my directory on the file server, it is returning the max date only from one directory level down. For example: `path/mydir/database_3/folder_2/table5.file` has a modification date of 2021-11-15. The script still returns `database_3 2020-06-01`, which was from `path/mydir/database_3/table_3.file` I'm wondering if table_3.file is being returned because table5.file's date is filtered out by the -mtime +30, and table_3.file is the next max date that is not within the 30 days – ComputersAreNeat Nov 17 '21 at 03:07
  • Exactly. As you mention, the said `table_5.file` is not included in the `find` output due to the rule `-mtime +30`. I was thinking this is the requirement, or do you expect something else? If you want to analyze the behavior of the script, modify the `mdate=` line as `find "$d" -type f -mtime +30 -printf "%TY-%Tm-%Td\t%TT\t%p\n"` and see the output of `find` command. – tshiono Nov 17 '21 at 03:35
  • Ah, maybe that's the confusion. Let me try and give a better example. I'm looking for old databases to delete. Under `path/mydir` is a list of database names as directories. If I navigate into a database directory, it lists tables and other files. I want to find databases that have not been modified within 30 days. To do this, I need to look at each file / subdirectory in each database name directory and determine if the most recently modified file is older than 30 days. This will tell me if the database itself hasn't been used in more than 30 days - and I can delete the entire database. – ComputersAreNeat Nov 17 '21 at 03:51
  • Hmm, I’m gradually getting. Then we should unlist `database_3` because it includes the recently updated file `table_5.file`,right? – tshiono Nov 17 '21 at 04:04
  • If I'm understanding the requirements correctly. please take a look of `[Edit]` section in my updated answer. Cheers. – tshiono Nov 17 '21 at 06:17
  • Right, database_3 would be unlisted because table_5.file has been modified within the last 30 days. With your edit, I get the following syntax error: `script.sh: line 8: ((: 1633983677.7848492010: syntax error: invalid arithmetic operator (error token is ".7848492010")` – ComputersAreNeat Nov 17 '21 at 15:16
  • Sorry for the inconvenience. The output of `-printf %T@` includes fractional part. My previous test environment is old and the `find` command returned integer only, then I forgot to remove the fractional part. Now fixed. Would you please test the updated script? BR. – tshiono Nov 17 '21 at 22:05
  • 1
    Works perfectly, thank you! – ComputersAreNeat Nov 19 '21 at 01:46
  • Thank you for testing it. Good to know it works. Cheers. – tshiono Nov 19 '21 at 01:51