As a personal project (in order to learn python better), I starter working on a duplicate file remover (especially for .mp3 files since I thought of it while trying to organise my full-of-duplicates music collection). Now, I'm fairly clear on how to proceed, matching file names and offering for deletion only those that present more that 0.7 similarity ratio, and using md5 sums for those files that are the same but have completely different names (eg: "metallica-nothing else matters" and "Track1"). The problem is that I don't know what to do about those files that have different names and they are a bit different from one another, for example, "nothing else matters" and "Track1" are the same except for the fact that "Track1" has 2 seconds of silence at the end. My question is: Is there some kind of method or algorithm that checks similarities between files themselves? Something like string matching but on files? Doesn't matter if it's a complicated algorithm, the harder the better since I'm doing this only to learn :D
Asked
Active
Viewed 582 times
0
-
4Matching MP3 files based on similarity strikes me as a highly non-trivial task. If you find a library that can do the matching out of the box, then use it in your project. If you can't, then I'd encourage you to pick a different project for learning the language. – NPE May 31 '12 at 13:30
-
2To do this in a meaningfull way, you will probably need to decode the MP3 files and than do some rather involved statistical analysis, possibly including the cross-correlation of the Fourier-transformed sound data. If you really pursue this project, you will learn a lot about statistics and little about Python. – Sven Marnach May 31 '12 at 13:32
-
well, I'm a computer science student looking at 4 months of free time (summer vacation), I kinda want to engage in a non-trivial task, if it can be done, I was thinking about using matlab/mathematica to get the sonograms, or the levels of noise (I'm not really familiar with sound manipulation terms), then plot them and compare resulting graphs, but then again this might be time and memory consuming – cpp_ninja May 31 '12 at 13:40
-
related: http://stackoverflow.com/a/551006/4279 – jfs May 31 '12 at 13:42
-
Similar questions: http://stackoverflow.com/q/3172911/222914 and http://stackoverflow.com/q/476227/222914 – Janne Karila May 31 '12 at 15:57
2 Answers
4
You could use Chromaprint, that computes a fingerprint for a piece of music. It should be able to find similar music files.
If you want to push this further, you could use the api of musicbrainz to find the exact information about a piece of music.
These libraries are used in two greats music library tagging and sorting applications I use : picard and beets.

madjar
- 12,691
- 2
- 44
- 52