I took over the tech operations at a small company. However, the previous lead made the erroneous decision of storing 100's of GB's of images, even though our website only ever uses around 5GB of those images. Basically, there are no cleanup scripts. I am now tasked with optimizing this mess and not quite sure where to start. Is there someway to get a list of the last time each image file was accessed via the web, so I can do something like "IF NOT OPENED IN LAST 365 DAYS THEN MOVE TO BACKUP DRIVE AND REMOVE FROM PRIMARY SERVER"?
2 Answers
You neglected to tell use the environment you are in (OS, web server etc.), so I assume Linux.
If you don't have mounted your data dir on the server with noatime
, you can use find
to search for files not accessed for 365 days:
find /var/www/images -iname "*.jpg" -atime +365 -type f
If you used noatime
, this won't be possible (and if you used relatime
, the atime
might be 24h off).
However, this is not a good approach, as you might up with dead links in your HTML files and someone will need this resource 5 days from now...
Better approach: Parse your web tree, list all files that are referenced in there (make sure to turn your web servers autoindexing off...) and archive everything else. This way you can make sure everything listed in your HTML files will still be available.
Beware, there is a chance you'll have isolated islands of HMTL files not linked to in your regular tree that people access via direct link - think about those when building your list. Of course, the same might be true for image files, but you can really only catch those with either log file parsing or the find
method.

- 98,649
- 14
- 180
- 226
Depending on how far back your web logs go, you could parse out all the entries for files from the directory in question and then delete everything not found.

- 9,370
- 2
- 25
- 36