How do I determine which images in the source of a very large website are actually being used?

Question

I recently inherited a rather large website with a gigantic, catastrophic mess of poorly named and organized images spread across multiple folders, referenced from multiple locations. I'm trying to consolidate some of the assets of the site and I need to know which of the hundreds of images are actually being used. Some of them may rest in image tags, others may be set as backgrounds using css, and still others may be created at runtime with javascript.

Because the images are so numerous, and because there is no discernible naming convention (i.e., img-asdfasd83mmd.png), and because the version control system in place up until my arrival consisted of duplicating existing files and only slightly changing the names of the old(i.e., img-asdfasdfasdfasf.png, img-asdfasdfasdf2.png, img-asdfasdfasdf-version4-final.png), this task isn't as simple as a quick visual scan.

I'm looking for an automated solution that will scan the source of this website and determine which images are being used and which aren't. Anything that provides some kind of solution for site-wide renaming of assets with automatic reference updates would be nice too. Thanks!

We'd need some more background info - what platform are you on, what framework was used to make this old website. I'm curious too - what is meant by "automatic reference updates" ? thanks — Caffeinated, Mar 27 '12 at 16:38

Tatjana Heuser · Answer 1 · 2012-03-29T12:10:27.217

From the mess you describe, I'm assuming that no single consistent system has been used to create them first hand, so even while there are some specialized solutions around, they're usually based on the authoring software they're supposed to support, and probably not much of a help in your case. I'm also afraid that there might not be a single automated solution fo you, best I can imagine is a handful of semi-automated approaches.

Very first step: Take a backup (You did that already, didn't you?).
analyze what's there (Here's where I'm seeing your question)
- if your filesystem supports it, scan the filesystem and archive the last access time of all files in the webserver hierarchy. Chances are, that files which were last accessed (read) at the same time they were created are backup copies of something else. This you would have to do first of all, since your own exploring will be prone to modify those dates.
- If those webpages are mostly static, you might be able to identify many of those images used for certain by using wget or another crawler/spider to crawl and mirror them. The resulting imagery would be the most prominent targets to get organized. One of those automated sitemap generator tools could be helpful in that process as well.
- Some pages and images which wget might have overlooked could be identified from the webserver logs - filter out the filenames served on some unixoid os (you're not interested in who asked for them, only the filenames they got), sort them, uniq (filter out the duplicates), and you're getting more which you cannot delete.
- try to deduplicate the files. Find duplicates of files (for example using an md5 hash), and reduce their use to a single instance. In the geographic vicinity (filesystem-wise) you might also find near-dupes, such as old versions offset by minor filename variations.
plan if you want to weed, or recreate the site

All in all, the more you need to weed out, the more time will go into that project. Draw a line when you've got an idea what you're up to, and decide if it wouldn't be more economic to rework the entire site, migrating only what is needed into a clear structure.

score 0 · Answer 2 · answered Mar 28 '12 at 12:30

0

You could try tools like A1 Website Analyzer. It would show you all images and from where they are linked and/or used. (However, it would not tell you of orphan image files, i.e. images neither used nor linked from anywhere.)

answered Mar 28 '12 at 12:30

Tom

3,587
9
69
124

How do I determine which images in the source of a very large website are actually being used?

2 Answers2