How to extract all timestamps of badminton shot sound in an audio clip using Neural Networks?

Question

I am trying to find the instances in a source audio file taken from a badminton match where a shot was hit by either of the players. For the same purpose, I have marked the timestamps with positive (hit sounds) and negative (no hit sound: commentary/crowd sound etc) labels like so:

shot_timestamps = [0,6.5,8, 11, 18.5, 23, 27, 29, 32, 37, 43.5, 47.5, 52, 55.5, 63, 66, 68, 72, 75, 79, 94.5, 96, 99, 105, 122, 115, 118.5, 122, 126, 130.5, 134, 140, 144, 147, 154, 158, 164, 174.5, 183, 186, 190, 199, 238, 250, 253, 261, 267, 269, 270, 274] 
shot_labels = ['no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no','no','no', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'yes', 'yes', 'no', 'no', 'yes', 'yes', 'no']

I have been taking 1 second windows around these timestamps like so:

rate, source = wavfile.read(source) 
def get_audio_snippets(shot_timestamps): 

    shot_snippets = []  # Collection of all audio snippets in the timestamps above 

    for timestamp in shot_timestamps: 
        start = math.ceil(timestamp*rate)
        end = math.ceil((timestamp + 1)*rate)
        if start >= source.shape[0]: 
            start = source.shape[0] - 1

        if end >= source.shape[0]: 
            end = source.shape[0] - 1  

        shot_snippets.append(source[start:end]) 
        
    return shot_snippets

and converting that to spectrogram images for the model. The model doesn't seem to be learning anything with an accuracy of around 50%. What can I do to improve the model?

Edit:

The audio file: Google Drive

The timestamps labels: Google Drive

Code: Github

These timestamps were made recently and haven't been used in the code above as I don't exactly know what window sizes to take for labelling purposes. The annotation file above has all the timestamps of hitting the shots.

PS: Also added this on Data Science Stackexchange as recommended: https://datascience.stackexchange.com/q/116629/98765

How are you doing the spectrogram conversion? How does the data look, when you plot spectrograms (say 10 of each) for class yes/no? — Jon Nordby, Nov 28 '22 at 15:24
@JonNordby thanks for your time. I have updated the question with most of the information you asked for here. The exact code can be found in the (3.1) file number in the Github repository. — ChaoS Adm, Dec 01 '22 at 06:12
I have updated my answer below to provide a full SED implementation using your data — Jon Nordby, Dec 11 '22 at 17:14
@JonNordby Thanks so much for the entire implementation. It cleared a lot of things for me. How did you obtain `test_start` and `val_start`? Its not defined anywhere in the notebook but has been used. Can you update the notebook to include the definition of those variables? — ChaoS Adm, Jan 11 '23 at 05:25
Those are the times (in seconds) when the test and validation splits start in the audio file. I have updated the Github now to calculate them. Must have accidentially deleted them before — Jon Nordby, Jan 11 '23 at 19:10

Jon Nordby · Answer 1 · 2022-12-11T16:34:51.553

Detecting when a particular sound happens is know as Sound Event Detection (SED). There are a wide range of approaches to this topic, as it has been actively researched for many decades.

Your existing solution, using correlation in the waveform domain with some template sounds is unlikely to work well for this task. This is because the amount of variation between badminton shot sounds in a match is likely to be quite high.

The recommended approach is to collect a small dataset, and use supervised learning to learn a detector. Say for example to take data from 20 different matches (preferably with different recording setups etc), and then annotate each short from time-periods, to get at least 50 shots from each match.

Sound Event Detection using deep-learning

A description of a modern deep-learning approach can be found in Sound Event Detection: A Tutorial. It describes the pieces that are needed:

Audio preprocessing using log-scaled mel spectrograms
Spliting the spectrogram into fixed-length overlapping windows
A model architecture using a Convolutional Recurrent Neural Network (CRNN)
Using a time-series (event activations) as the output/target of the neural network
Post-processing the continuous event activations into discrete events
Evaluating model performance using event-based metrics

A complete implementation of this, using the audio and labels for the match that you have annotated can be found in this notebook.

I reproduce some of the key code here, for posterity.

SEDNet model

def build_sednet(input_shape, filters=128, cnn_pooling=(5, 2, 2), rnn_units=(32, 32), dense_units=(32,), n_classes=1, dropout=0.5):
    """
    SEDnet type model
    Based https://github.com/sharathadavanne/sed-crnn/blob/master/sed.py
    """
    from tensorflow.keras import Model
    from tensorflow.keras.layers import Input, Bidirectional, Conv2D, BatchNormalization, Activation, \
            Dense, MaxPooling2D, Dropout, Permute, Reshape, GRU, TimeDistributed
    
    spec_start = Input(shape=(input_shape[-3], input_shape[-2], input_shape[-1]))
    spec_x = spec_start
    for i, pool in enumerate(cnn_pooling):
        spec_x = Conv2D(filters=filters, kernel_size=(3, 3), padding='same')(spec_x)
        spec_x = BatchNormalization(axis=1)(spec_x)
        spec_x = Activation('relu')(spec_x)
        spec_x = MaxPooling2D(pool_size=(1, pool))(spec_x)
        spec_x = Dropout(dropout)(spec_x)
    spec_x = Permute((2, 1, 3))(spec_x)
    spec_x = Reshape((input_shape[-3], -1))(spec_x)

    for units in rnn_units:
        spec_x = Bidirectional(
            GRU(units, activation='tanh', dropout=dropout, recurrent_dropout=dropout, return_sequences=True),
            merge_mode='mul')(spec_x)

    for units in dense_units:
        spec_x = TimeDistributed(Dense(units))(spec_x)
        spec_x = Dropout(dropout)(spec_x)
    spec_x = TimeDistributed(Dense(n_classes))(spec_x)

    out = Activation('sigmoid', name='strong_out')(spec_x)
    model = Model(inputs=spec_start, outputs=out)
    return model

Try first with a low complexity model with a modest amount of parameters.

model = build_sednet(input_shape, n_classes=1,
                         filters=10,
                         cnn_pooling=[2, 2, 2],
                         rnn_units=[5, 5],
                         dense_units=[16],
                         dropout=0.1)

Splitting input into windows

def compute_windows(arr, frames, pad_value=0.0, overlap=0.5, step=None):
    if step is None:
        step = int(frames * (1-overlap))
        
    windows = []
    
    width, length = arr.shape
    
    for start_idx in range(0, length, step):
        end_idx = min(start_idx + frames, length)

        # create emmpty
        win = numpy.full((width, frames), pad_value, dtype=float)
        # fill with data
        win[:, 0:end_idx-start_idx] = arr[:,start_idx:end_idx]

        windows.append(win)

    return windows

Training

Is done in the standard fashion for a Keras model.

Using trained model

To get the event predictions we need to:

Split spectrogram into window
Run the model on all windows
Merge the predictions from the windows

Here is the key code for that.

def merge_overlapped_predictions(window_predictions, window_hop):
    
    # flatten the predictions from overlapped windows
    predictions = []
    for win_no, win_pred in enumerate(window_predictions):
        win_start = window_hop * win_no
        for frame_no, p in enumerate(win_pred):
            s = {
                'frame': win_start + frame_no,
                'probability': p,
            }
        
            predictions.append(s)

    df = pandas.DataFrame.from_records(predictions)
    df['time'] = pandas.to_timedelta(df['frame'] * time_resolution, unit='s')
    df = df.drop(columns=['frame'])

    # merge predictions from multiple windows 
    out = df.groupby('time').median()
    return out

def predict_spectrogram(model, spec):
    
    # prepare input data. NOTE: must match the training preparation in getXY
    window_hop = 1
    wins = compute_windows(spec, frames=window_length, step=window_hop)       
    X = numpy.expand_dims(numpy.stack( [ (w-Xm).T for w in wins ]), -1)
    
    # make predictions on windows
    y = numpy.squeeze(model.predict(X, verbose=False))

    out = merge_overlapped_predictions(y, window_hop=window_hop)
    return out

Results

Here are the results when trained on the first 3.5 minutes of audio, and then using the last 1.5 minutes as validation + test.

The annotated ground truth is shown in green, and output predictions in blue. A threshold of around 0.3 would be better than 0.5 shown here.

The event-wise F1 score for val/test is around 0.75. But with training data from multiple matches I expect this to improve greatly.

So you're essentially recommending setting up a sort of CNN with spectrogram images as inputs and the manual annotations as labels for training purposes? Thereafter using this model to extract all the timestamps in a particular match? — ChaoS Adm, Nov 17 '22 at 16:56
Yep that is a good general approach. Tthe CNN should process short time-windows, enough to contain the event of interest and not much more. And the label would be whether an event exists inside this window. — Jon Nordby, Nov 18 '22 at 10:53
I have one final question: Even though you suggest taking 50 shots from each match, I would have to take more time-windows and also capture time windows where the event does not occur for training purpose as well, right? — ChaoS Adm, Nov 19 '22 at 04:56
Yes, you need the "negative" data as well. So choose some long-ish time periods (like maybe 5 minutes), and go through all of that. Mark all events of interest in that period. Then any time in that section which does not have an annotation is implicitly "no event". Splitting into windows should not be done during labling - but rather during training. — Jon Nordby, Nov 21 '22 at 12:40
I tried doing this but I am hardly getting an accuracy of 50%. Any ideas on how I can boost accuracy? I have been taking 1 second windows around the timestamp where the event occurs and converting that to spectrogram images for the model. The model doesn't seem to be learning anything. I can make my code available if it helps! — ChaoS Adm, Nov 26 '22 at 18:48
Make a new question in Datascience stack exchange (since it is not strictly a programming question, what SO is for) - and then link it here. — Jon Nordby, Nov 27 '22 at 18:49
After the previous comment, the answer was improved a lot, with a complete solution that reaches 0.75 F1 score. — Jon Nordby, Apr 14 '23 at 14:51