Optimizing a Mass ID3 Tag Scan

Question

I'm building a small tool that I want to scan over a music collection, read the ID3 info of a track, and store it as long as that particular artist does not have a song that has been accessed more than twice. I'm planning on using Mutagen for reading the tags.

However, the music collections of myself and many others are massive, exceeding 20,000 songs. As far as I know, libraries like Mutagen have to open and close every song to get the ID3 info from it. While MP3s aren't terribly performance-heavy, that's a lot of songs. I'm already planning a minor optimization in the form of keeping a count of each artist and not storing any info if their song count exceeds 2, but as far as I can tell I still need to open every song to check the artist ID3 tag.

I toyed with the idea of using directories as a hint for the artist name and not reading any more info in that directory once the artist song count exceeds 2, but not everyone has their music set up in neat Artist/Album/Songs directories.

Does anyone have any other optimizations in mind that might cut down on the overhead of opening so many MP3s?

score 1 · Accepted Answer · answered Mar 10 '13 at 17:26

Beware of premature optimization. Are you really sure that this will be a performance problem? What are your requirements -- how quickly does the script need to run? How fast does it run with the naïve approach? Profile and evaluate before you optimize. I think there's a serious possibility that you're seeing a performance problem where none actually exists.

You can't avoid visiting each file once if you want a guaranteed correct answer. As you've seen, optimizations that entirely skip files will basically amount to automated guesswork.

Can you keep a record of previous scans you've done, and on a subsequent scan use the last-modified dates of the files to avoid re-scanning files you've already scanned once? This could mean that your first scan might take a little bit of time, but subsequent scans would be faster.

If you need to do a lot of complex queries on a music collection quickly, consider importing the metadata of the entire collection into a database (for instance SQLite or MySQL). Importing will take time -- updating to insert new files will take a little bit of time (checking the last-modified dates as above). Once the data is in your database, however, everything should be fairly snappy assuming that the database is set up sensibly.

Hitting the database AND reading from disk is a lot of I/O. Once it's in it's in, but getting it in there can take a long time. — Makoto, Mar 10 '13 at 17:28

score 1 · Answer 2 · answered Mar 10 '13 at 17:34

In general for this question i would recommend you using multiple ways of detecting an artist or track title:

1st way to check: Is the filename maybe in ARTIST-TITLE.mp3 format? (or similar)
(filename for this would be "Artist-Track.mp3")

for file in os.listdir(PATH_TO_MP3s):
   artist = re.split("[\_\-\.]", file)[-3]
   track = re.split("[\_\-\.]", file)[-2]
   filetype = re.split("[\_\-\.]", file)[-1]

Of course you have to make sure if the file is in that format first.

2nd step (if first doesn't fit for that file) would be checking if the directory names fit (like you said)

3rd and last one would be to check the ID3 tags.

But make sure to check if the values are the right before trusting it.
For example if someone would use "Track-Artist.mp3" for the code i provided artist and track would be switched.

Optimizing a Mass ID3 Tag Scan

2 Answers2