The OP's problem can be summarized as follows:
In the generalized audio stream of a video, try to detect "music" versus "everything else".
Where "music" is not likely to exist in fingerprint databases.
And where "everything else" in this context must include:
- speech
- silence
- synthetic sounds
- foley sounds (explosions, gunshots, footfalls, etc.)
We must also assume that the audio soundtrack of a generalized video is highly processed with echo, reverb, multichannel panning, etc.
In the general video case, all of the above audio elements would be mixed together into the final audio, making the problem domain absolutely immense.
This is a very challenging problem, with most likely no simple or robust solution.
In support of this premise, a general music classifier (let's call it MuCLAS), where the unknown music sample is a member of the classifier training set, is a very difficult problem, due to the significant expense involved in creating the training set, and in tuning and creating the classifier index.
But the OP's problem domain is much larger than the MuCLAS problem domain, due to the much higher entropy of the OP's unknown data set. This implies much higher complexity and cost, relative to MuCLAS.
Another supporting argument for the above premise, is that the state of the art in general speech recognition assumes and insists upon, much lower entropy in the unknown data set, than the implied entropy of the OP's data set.
And speech recognition is one of the best funded problems in the general field of autonomous pattern recognition.