I am trying to measure the "loudness" of various clips (ranging from ~2-40 seconds) of TV content. I'm interested in the relative loudness of the content - what scenes have people shouting vs whispering, loud music vs. quiet scenes, etc.
I think this means I'm interested in capturing the gain (INPUT loudness) not the volume (OUTPUT loudness)...
I have tried two methods with Python:
librosa's RMS:
np.mean(librosa.feature.rms(spectrogram, center=True).T, axis=0)
pyloudnorm: (which implements the ITU-R BS.1770-4 loudness algorithm (LUFS))
meter = pyln.Meter(samplerate) loudness = meter.integrated_loudness(waveform)
When I compare the results of the two, they are sometimes aligned, but often different (the same articles show a relatively high RMS, but low loudness, and vice versa). More importantly, while they both appear to get some things right, neither seems to be a very accurate representation of what is coming out of the TV. I'm wondering if there is some step I need to take to filter out some frequencies that are not perceived but affect these metrics in some way, or if I'm just missing something major?