I'm trying to develop a model, which could be able to recognize from the audio file if the trigger word ('hello') occurs in time. I used some ideas from NG Andrew's course from Coursera, but in my case something doesn't work.
I've built a model:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 3937, 129) 0
_________________________________________________________________
conv1d_1 (Conv1D) (None, 981, 196) 379456
_________________________________________________________________
batch_normalization_1 (Batch (None, 981, 196) 784
_________________________________________________________________
activation_1 (Activation) (None, 981, 196) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 981, 196) 0
_________________________________________________________________
gru_1 (GRU) (None, 981, 128) 124800
_________________________________________________________________
dropout_2 (Dropout) (None, 981, 128) 0
_________________________________________________________________
batch_normalization_2 (Batch (None, 981, 128) 512
_________________________________________________________________
gru_2 (GRU) (None, 981, 128) 98688
_________________________________________________________________
dropout_3 (Dropout) (None, 981, 128) 0
_________________________________________________________________
batch_normalization_3 (Batch (None, 981, 128) 512
_________________________________________________________________
dropout_4 (Dropout) (None, 981, 128) 0
_________________________________________________________________
time_distributed_1 (TimeDist (None, 981, 1) 129
=================================================================
Total params: 604,881
Trainable params: 603,977
Non-trainable params: 904
_________________________________________________________________
I've created a dataset by myself with 3937 examples and transformed each audio file into its spectogram, so:
Input - spectogram of audio file,
Output - time vector with values from 0-1.
Time vector has initially 10000 timestamps, but to make it possible to fit the model, I've digitalized it, so finally it has 981 timestamps.
To train I used this piece of code:
opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
mcp_save = ModelCheckpoint('model-{epoch:03d}-{acc:03f}-{val_acc:03f}.h5', save_best_only=True, monitor='val_loss', mode='auto')
model.fit(X, Y, batch_size=8, epochs=150, validation_split=0.2, callbacks=[mcp_save])
As I could see, the accuracy was increasing in first 25 epochs and reached about 90%. After that, it got stuck - acc didn't change much, so as loss. When val_acc was appearing, it was equal to 99%.
I stopped training after 40th epoch and tried to test it with example the model didn't see before. Y vector (label) for this example should be:
https://i.stack.imgur.com/OHY8d.png
and I received the result:
https://i.stack.imgur.com/zAv1r.png
In this case, audio file contains 4 words, but only one of them is the trigger word (second one).
I don't really understand why my model gives me a result with values 0 - 0.4. I was trying other examples and it was the same. What's more, I would like to know how to inverse the result, so it should have the highest values after hearing trigger word, not the lowest. And last, but not least - what can I do, to learn the model to recognize this certain word?
I also should mention, that I was trying to train model with larger batch_size, with ReduceLROnPlateau and also I was evaluating the result with examples from training set and it was still the same, so I don't think it's an overfitting issue.
Any ideas how to fix it? Thanks in advance :)