-2

Is there any way to measure the quality and appeal/aesthetic of an audio clip? The quality quantifies how good the sound is, ie., the lower the noise the better the quality is. Whereas the appeal/aesthetic measures how appealing the sound is to the human. There exists some work for image quality and aesthetic assessment like NIMA, but not for sound/audio. Any method or references will be helpful.

Saikat
  • 1,209
  • 3
  • 16
  • 30
  • 1
    appeal is necessarily tied to current vibes of the person so any metric of a given clip can only be measured if at all only after assessing the person ... if that was possible the resultant metric would only be valid for the point in time of that listening and wildly invalid at other times and at all times for other people – Scott Stensland Nov 17 '19 at 15:24

2 Answers2

2

We recently published a PESQ variant for the PyTorch framework, you can find it here: https://github.com/audiolabs/torch-pesq

This allows you to use an perceptual metric for wideband speech quality in context of deep learning and generate gradients for your training.

bytesnake
  • 21
  • 3
1

Measurements of audio quality or asthetic is done both with and without machine learning. However most of the work focuses on speech reproduction, much less on general audio.

One can conduct listening tests, where a panel of human asessors listen to audio and give their score, to establish a Mean Opinion Score (MOS). There exists several standards for conducting these, such as MUSHRA. Such subjective scores form the basis of developing "objective metrics", which are algorithmic ways to estimate qualities of the audio. Some early examples are PESQ for Speech Quality (ITU standard since 2001) and PEAQ for Audio Quality (ITU standard since 1998). More advanced include POLQA (ITU standard since 2011) and ViSQOLAudio (proposed in research).

The last years several papers have shown that one can learn such metrics using deep neural networks. For speech quality, one recent paper (2019) is Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network.

The only learned evaluation I have found for general Audio Quality or music quality is Fréchet Audio Distance.

Jon Nordby
  • 5,494
  • 1
  • 21
  • 50