2

I have a collection of media files, mostly music, most of them having been imported from CD many years ago. This collection has been transferred between different media players, different filesystems, different computers, etc, many times. In that process, some tracks have been accidentally duplicated. I'm also constantly trying to curate the metadata on these and get everything properly tagged, since when much of it was originally imported, I did not have fancy media playback software and did not even realize that the ID3 tags indicated that everything was just "Track %d" on the classic album "Album".

This creates a situation where I have some files with up-to-date metadata, but "duplicates" of the same media file that I'd like to delete, whose metadata has not been properly updated. Since the metadata is present within the file, the contents of these files now differ and tools like liten2 don't work.

My question is: is there a library I can use that will conveniently extract a uniquely identifying fingerprint (probably a cryptographic hash of some kind, but that's not a hard requirement) of the media content only of the file, ignoring the metadata? If so, how do I use it?

Glyph
  • 31,152
  • 11
  • 87
  • 129
  • http://musicbrainz.org/ Saved me on this one.. manually just start fixing your albums/artists one by one.. It has a pretty cool thing when you're done upon save it moves them to another directory with the correct tree structure..! Not really programming thing but an idea to save your music! – Lipis Dec 10 '12 at 00:13
  • 1
    To those that closed the question: I don't understand why the question was closed and I'd like to reopen it. As far as I can tell, the question is really specific and clear: the input is an MP3 file, the output is a hash that ignores metadata and thus will be the same for a media file who has had its metadata altered but not its media. – Glyph Dec 26 '12 at 21:19
  • 1
    @Glyph: At a brief glance the question reads a bit like "tell me what library to use" and/or "please solve my very specific problem for me". I don't think that's actually the case (and have voted to reopen), but there are SO. MANY. BAD. QUESTIONS. that people tend to get trigger-happy with the close votes (myself included). – C. A. McCann Dec 26 '12 at 21:43
  • 1
    @Glyph I was one of them.. but reading it again.. McCann is right.. :) Sorry for that.. – Lipis Dec 26 '12 at 22:39
  • Thanks for the reopen votes, folks. I appreciate you coming back to take a second look :). – Glyph Dec 28 '12 at 01:51
  • I just ran across the `afprint` tool today on OS X, and I'm wondering if this will do what I want... – Glyph Jun 28 '15 at 00:10

3 Answers3

4

Echoprint is one free way to fingerprint audio by its content - i.e. it doesn't depend on metadata, nor on byte-exact data matches. Their FAQ has an entry "I want to deduplicate a big collection".

I think the core of it is not itself python but a web API - but they provide pyechonest library.

Dan Stowell
  • 4,618
  • 2
  • 20
  • 30
  • I'd prefer an API that I could use locally, but this is totally a valid answer to my question :). – Glyph Dec 10 '12 at 17:52
  • Aand I'm un-accepting the answer now since this did go exactly the way non-local APIs often do, and has now been discontinued. – Glyph May 11 '20 at 05:31
3

You will probably need to dive a bit into the file format specifications of your audio files (mp3, avi, mpg, ogg, etc). For mp3 this would be to discard all ID3v2 Metadata chunks. Identify inside the file those chunks, that actually encode audio information and then hash those chunks for comparison. Bear in mind, that if you have two files of the same track in different formats, they will not be recognized as the same file. Also if you have the same track twice in the same format, but with e.g. different bitrates, they won't be identical neither.

Hyperboreus
  • 31,997
  • 9
  • 47
  • 87
  • Mostly I'm dealing with files which were duplicated and then had their metadata edited, so something that could hash the exact bits of the encoded media without worrying about the audio would be good enough. The question, though, is about whether a library exists to do this for me... – Glyph Dec 26 '12 at 21:20
-1

How about (temporarily) converting the files to WAV-format and comparing the hashes of them? The ID3 tags should be stripped off then. There are plenty of tools to do that and embedding this procedure into a script should be not too difficult.