keras: how to aggregate over frame-level predictions to song-level prediction

Question

I am doing a song genre classification. For each song, I have chopped them into small frames (5s) to generate spectrogram as input features for a neural network and each frame has an associated song genre label.

The data looks like the following:

   name         label   feature
   ....
   song_i_frame1 label   feature_vector_frame1
   song_i_frame2 label   feature_vector_frame2
   ...
   song_i_framek label   feature_vector_framek
   ...

I can get a prediction accuracy for each frame from Keras with no problem. But currently, I do not how to aggregate the prediction results from frame-level to song level with a majority voting since the data fed into the when keras model, their names are lost.

How can I retain the names of each label (for example, the song_i_frame1) in the keras outputs to form an aggregate prediction to the song via majority voting. Or, are there other methods to aggregate to song-level prediction???

I googled around but cannot find an answer to this and would appreciate any insight.

Are your original labels per song? And then each frame gets the label from the song it is part of? — Jon Nordby, Jan 23 '19 at 12:59
yes. each frame gets the label from the song it is part of and I want to find a way to aggregate these predictions to song level — lll, Mar 20 '19 at 17:16

Jon Nordby · Accepted Answer · 2020-03-20T09:26:07.307

In the dataset each label might be named (ex: 'rock'). To use this with a neural network, this needs to be transformed to an integer (ex: 2), and then to a one-hot-encoding (ex: [0,0,1]). So 'rock' == 2 == [0,0,1]. Your output predictions will be in this one-hot-encoded form. [ 0.1, 0.1, 0.9 ] means that class 2 was predicted, [ 0.9, 0.1, 0.1 ] means class 0 etc. To do this in a reversible way, use sklearn.preprocessing.LabelBinarizer.

There are several ways of combining frame-predictions into an overall prediction. The most common are, in increasing order of complexity:

Majority voting on probabilities
Mean/average voting on probabilities
Averaging on log-odds of probabilities
Sequence model on log-odds of probabilities
Multiple-Instance Learning

Below is an example of the three first ones.

import numpy
from sklearn.preprocessing import LabelBinarizer

labels = [ 'rock', 'jazz', 'blues', 'metal' ] 

binarizer = LabelBinarizer()
y = binarizer.fit_transform(labels)

print('labels\n', '\n'.join(labels))
print('y\n', y)

# Outputs from frame-based classifier. 
# input would be all the frames in one song
# frame_predictions = model.predict(frames)
frame_predictions = numpy.array([
    [ 0.5, 0.2, 0.3, 0.9 ],
    [ 0.9, 0.2, 0.3, 0.3 ],
    [ 0.5, 0.2, 0.3, 0.7 ],
    [ 0.1, 0.2, 0.3, 0.5 ],
    [ 0.9, 0.2, 0.3, 0.4 ],
])

def vote_majority(p):
    voted = numpy.bincount(numpy.argmax(p, axis=1))
    normalized = voted / p.shape[0]
    return normalized

def vote_average(p):
    return numpy.mean(p, axis=0)

def vote_average_logits(p):
    logits = numpy.log(p / (1 - p))
    avg = numpy.mean(logits, axis=1)
    p = 1/(1+ numpy.exp(-avg))
    return p


maj = vote_majority(frame_predictions)
mean = vote_average(frame_predictions)
mean_logits = vote_average_logits(frame_predictions)

genre_maj = binarizer.inverse_transform(numpy.array([maj]))
genre_mean = binarizer.inverse_transform(numpy.array([mean]))
genre_mean_logits = binarizer.inverse_transform(numpy.array([mean_logits]))
print('majority voting', maj, genre_maj)
print('mean voting', mean, genre_mean)
print('mean logits voting', mean_logits, genre_mean_logits)

Output

labels:
 rock
 jazz
 blues
 metal
y:
 [[0 0 0 1]
 [0 1 0 0]
 [1 0 0 0]
 [0 0 1 0]]
majority voting: [0.4 0.  0.  0.6] ['rock']
mean voting: [0.58 0.2  0.3  0.56] ['blues']
mean logits voting [0.49772704 0.44499443 0.41421356 0.24829914 0.4724135 ] ['blues']

A simple improvement over averaging probabilities, is to compute the logits (log-odds) of the probability and average that. This more properly accounts for things that are very likely or unlikely. It can be seen as a Naive Bayes, computing the posterior probability under the assumption that the frames are independent.

One can also perform voting by using a classifier trained on the frame-wise predictions, though this not so commonly done and is complicated when input length varies. A simple sequence model can be used, ie an Recurrent Neural Network (RNN) or a Hidden Markov Model (HMM).

Another alternative is to use Multiple-Instance-Learning with GlobalAveragePooling over the frame-based classifications, to learn on whole songs at once.

thanks for this! But I still have a confusion, i want to make the loss function to minimize the aggregated prediction (e.g., by majority voting) of each song rather than the loss at the frame level. Is this a case of multiple instance learning? — lll, Mar 20 '19 at 22:21
Yep, that would be multiple instance learning! Open a new question about it, link it here, and I will answer it :) — Jon Nordby, Mar 20 '19 at 22:37
thanks! the other question is over there: https://stackoverflow.com/questions/55272508/keras-how-to-write-customized-loss-function-to-aggregate-over-frame-level-predi — lll, Mar 21 '19 at 01:29

keras: how to aggregate over frame-level predictions to song-level prediction

1 Answers1

Linked