1

I'm trying to develop a model, which could be able to recognize from the audio file if the trigger word ('hello') occurs in time. I used some ideas from NG Andrew's course from Coursera, but in my case something doesn't work.

I've built a model:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 3937, 129)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 981, 196)          379456    
_________________________________________________________________
batch_normalization_1 (Batch (None, 981, 196)          784       
_________________________________________________________________
activation_1 (Activation)    (None, 981, 196)          0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 981, 196)          0         
_________________________________________________________________
gru_1 (GRU)                  (None, 981, 128)          124800    
_________________________________________________________________
dropout_2 (Dropout)          (None, 981, 128)          0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 981, 128)          512       
_________________________________________________________________
gru_2 (GRU)                  (None, 981, 128)          98688     
_________________________________________________________________
dropout_3 (Dropout)          (None, 981, 128)          0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 981, 128)          512       
_________________________________________________________________
dropout_4 (Dropout)          (None, 981, 128)          0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 981, 1)            129       
=================================================================
Total params: 604,881
Trainable params: 603,977
Non-trainable params: 904
_________________________________________________________________

I've created a dataset by myself with 3937 examples and transformed each audio file into its spectogram, so:

Input - spectogram of audio file,

Output - time vector with values from 0-1.

Time vector has initially 10000 timestamps, but to make it possible to fit the model, I've digitalized it, so finally it has 981 timestamps.

To train I used this piece of code:

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
mcp_save = ModelCheckpoint('model-{epoch:03d}-{acc:03f}-{val_acc:03f}.h5', save_best_only=True, monitor='val_loss', mode='auto')
model.fit(X, Y, batch_size=8, epochs=150, validation_split=0.2, callbacks=[mcp_save])

As I could see, the accuracy was increasing in first 25 epochs and reached about 90%. After that, it got stuck - acc didn't change much, so as loss. When val_acc was appearing, it was equal to 99%.

I stopped training after 40th epoch and tried to test it with example the model didn't see before. Y vector (label) for this example should be:

https://i.stack.imgur.com/OHY8d.png

and I received the result:

https://i.stack.imgur.com/zAv1r.png

In this case, audio file contains 4 words, but only one of them is the trigger word (second one).

I don't really understand why my model gives me a result with values 0 - 0.4. I was trying other examples and it was the same. What's more, I would like to know how to inverse the result, so it should have the highest values after hearing trigger word, not the lowest. And last, but not least - what can I do, to learn the model to recognize this certain word?

I also should mention, that I was trying to train model with larger batch_size, with ReduceLROnPlateau and also I was evaluating the result with examples from training set and it was still the same, so I don't think it's an overfitting issue.

Any ideas how to fix it? Thanks in advance :)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Ania
  • 51
  • 7
  • We'd need some code to help you, probably both for training and prediction. If it's too much to paste here, consider linking the repo. – Lukasz Tracewski Jun 11 '19 at 05:53
  • https://codebunk.com/b/236343264/ - here is the part of training. I have to mention, that my results are already "inversed", but prediction is still not so good - it still has difficulties with predicting this certain word and also predictions are always below 0.5 – Ania Jun 22 '19 at 19:10

0 Answers0