Detecting when a particular sound happens is know as Sound Event Detection (SED). There are a wide range of approaches to this topic, as it has been actively researched for many decades.
Your existing solution, using correlation in the waveform domain with some template sounds is unlikely to work well for this task. This is because the amount of variation between badminton shot sounds in a match is likely to be quite high.
The recommended approach is to collect a small dataset, and use supervised learning to learn a detector. Say for example to take data from 20 different matches (preferably with different recording setups etc), and then annotate each short from time-periods, to get at least 50 shots from each match.
Sound Event Detection using deep-learning
A description of a modern deep-learning approach can be found in Sound Event Detection: A Tutorial. It describes the pieces that are needed:
- Audio preprocessing using log-scaled mel spectrograms
- Spliting the spectrogram into fixed-length overlapping windows
- A model architecture using a Convolutional Recurrent Neural Network (CRNN)
- Using a time-series (event activations) as the output/target of the neural network
- Post-processing the continuous event activations into discrete events
- Evaluating model performance using event-based metrics
A complete implementation of this, using the audio and labels for the match that you have annotated can be found in this notebook.
I reproduce some of the key code here, for posterity.
SEDNet model
def build_sednet(input_shape, filters=128, cnn_pooling=(5, 2, 2), rnn_units=(32, 32), dense_units=(32,), n_classes=1, dropout=0.5):
"""
SEDnet type model
Based https://github.com/sharathadavanne/sed-crnn/blob/master/sed.py
"""
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Bidirectional, Conv2D, BatchNormalization, Activation, \
Dense, MaxPooling2D, Dropout, Permute, Reshape, GRU, TimeDistributed
spec_start = Input(shape=(input_shape[-3], input_shape[-2], input_shape[-1]))
spec_x = spec_start
for i, pool in enumerate(cnn_pooling):
spec_x = Conv2D(filters=filters, kernel_size=(3, 3), padding='same')(spec_x)
spec_x = BatchNormalization(axis=1)(spec_x)
spec_x = Activation('relu')(spec_x)
spec_x = MaxPooling2D(pool_size=(1, pool))(spec_x)
spec_x = Dropout(dropout)(spec_x)
spec_x = Permute((2, 1, 3))(spec_x)
spec_x = Reshape((input_shape[-3], -1))(spec_x)
for units in rnn_units:
spec_x = Bidirectional(
GRU(units, activation='tanh', dropout=dropout, recurrent_dropout=dropout, return_sequences=True),
merge_mode='mul')(spec_x)
for units in dense_units:
spec_x = TimeDistributed(Dense(units))(spec_x)
spec_x = Dropout(dropout)(spec_x)
spec_x = TimeDistributed(Dense(n_classes))(spec_x)
out = Activation('sigmoid', name='strong_out')(spec_x)
model = Model(inputs=spec_start, outputs=out)
return model
Try first with a low complexity model with a modest amount of parameters.
model = build_sednet(input_shape, n_classes=1,
filters=10,
cnn_pooling=[2, 2, 2],
rnn_units=[5, 5],
dense_units=[16],
dropout=0.1)
Splitting input into windows
def compute_windows(arr, frames, pad_value=0.0, overlap=0.5, step=None):
if step is None:
step = int(frames * (1-overlap))
windows = []
width, length = arr.shape
for start_idx in range(0, length, step):
end_idx = min(start_idx + frames, length)
# create emmpty
win = numpy.full((width, frames), pad_value, dtype=float)
# fill with data
win[:, 0:end_idx-start_idx] = arr[:,start_idx:end_idx]
windows.append(win)
return windows
Training
Is done in the standard fashion for a Keras model.
Using trained model
To get the event predictions we need to:
- Split spectrogram into window
- Run the model on all windows
- Merge the predictions from the windows
Here is the key code for that.
def merge_overlapped_predictions(window_predictions, window_hop):
# flatten the predictions from overlapped windows
predictions = []
for win_no, win_pred in enumerate(window_predictions):
win_start = window_hop * win_no
for frame_no, p in enumerate(win_pred):
s = {
'frame': win_start + frame_no,
'probability': p,
}
predictions.append(s)
df = pandas.DataFrame.from_records(predictions)
df['time'] = pandas.to_timedelta(df['frame'] * time_resolution, unit='s')
df = df.drop(columns=['frame'])
# merge predictions from multiple windows
out = df.groupby('time').median()
return out
def predict_spectrogram(model, spec):
# prepare input data. NOTE: must match the training preparation in getXY
window_hop = 1
wins = compute_windows(spec, frames=window_length, step=window_hop)
X = numpy.expand_dims(numpy.stack( [ (w-Xm).T for w in wins ]), -1)
# make predictions on windows
y = numpy.squeeze(model.predict(X, verbose=False))
out = merge_overlapped_predictions(y, window_hop=window_hop)
return out
Results
Here are the results when trained on the first 3.5 minutes of audio, and then using the last 1.5 minutes as validation + test.


The annotated ground truth is shown in green, and output predictions in blue. A threshold of around 0.3 would be better than 0.5 shown here.
The event-wise F1 score for val/test is around 0.75. But with training data from multiple matches I expect this to improve greatly.