Trained E2E speech recognition model does not recognize even training data correctly

Question

I trained an E2E speech recognition model using Conformer encoder and Transformer decoder with Hybrid CTC/Attention, but it does not recognize even the training data correctly.

I trained about 20 epochs of this model. However, it did not drop from about 2.5 in CTCLoss. This was the case even when I changed the ramda. The CrossEntropyLoss went down to 0.5, but the autoregressive recognition of the training data only produced similar results.

Why doesn't the loss go down for CTC?

Why does decoder only output the same results?

Since CE only produces similar results, I assume that the encoder is not able to learn, but I don't know the cause. If anyone knows more about it, please let me know.

I've uploaded the detailed code here.

Model

The model structure is as follows: model image

Visualization

if log-mel spectrogram's shape is (T=160, 80), target shape is (num of characters=30,), mask shape is (30, 30), padding mask shape is (30), then...

===================================================================================================================
Layer (type:depth-idx)                                            Output Shape              Param #
===================================================================================================================
AVSRwithConf2                                                     --                        --
├─ModuleList: 1-1                                                 --                        --
├─TransformerDecoder: 1                                           --                        --
│    └─ModuleList: 2-1                                            --                        --
├─Sequential: 1-2                                                 [2, 256, 39, 19]          --
│    └─Conv2d: 2-2                                                [2, 256, 79, 39]          2,560
│    └─ReLU: 2-3                                                  [2, 256, 79, 39]          --
│    └─Conv2d: 2-4                                                [2, 256, 39, 19]          590,080
│    └─ReLU: 2-5                                                  [2, 256, 39, 19]          --
├─Sequential: 1-3                                                 [2, 39, 256]              --
│    └─Linear: 2-6                                                [2, 39, 256]              --
│    │    └─Linear: 3-1                                           [2, 39, 256]              1,245,440
├─PositionalEncoding: 1-4                                         [2, 39, 256]              --
├─Sequential: 1                                                   --                        --
│    └─PositionalEncoding: 2-7                                    [2, 39, 256]              --
│    └─Dropout: 2-8                                               [2, 39, 256]              --
├─Embedding: 1-5                                                  [2, 30, 256]              10,496
├─PositionalEncoding: 1-6                                         [30, 2, 256]              --
├─Sequential: 1                                                   --                        --
│    └─PositionalEncoding: 2-9                                    [30, 2, 256]              --
├─ModuleList: 1-1                                                 --                        --
│    └─ConformerBlock: 2-10                                       [2, 39, 256]              --
│    │    └─Sequential: 3-2                                       [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-11                                       [2, 39, 256]              --
│    │    └─Sequential: 3-3                                       [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-12                                       [2, 39, 256]              --
│    │    └─Sequential: 3-4                                       [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-13                                       [2, 39, 256]              --
│    │    └─Sequential: 3-5                                       [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-14                                       [2, 39, 256]              --
│    │    └─Sequential: 3-6                                       [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-15                                       [2, 39, 256]              --
│    │    └─Sequential: 3-7                                       [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-16                                       [2, 39, 256]              --
│    │    └─Sequential: 3-8                                       [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-17                                       [2, 39, 256]              --
│    │    └─Sequential: 3-9                                       [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-18                                       [2, 39, 256]              --
│    │    └─Sequential: 3-10                                      [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-19                                       [2, 39, 256]              --
│    │    └─Sequential: 3-11                                      [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-20                                       [2, 39, 256]              --
│    │    └─Sequential: 3-12                                      [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-21                                       [2, 39, 256]              --
│    │    └─Sequential: 3-13                                      [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-22                                       [2, 39, 256]              --
│    │    └─Sequential: 3-14                                      [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-23                                       [2, 39, 256]              --
│    │    └─Sequential: 3-15                                      [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-24                                       [2, 39, 256]              --
│    │    └─Sequential: 3-16                                      [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-25                                       [2, 39, 256]              --
│    │    └─Sequential: 3-17                                      [2, 39, 256]              1,588,736
│    └─ConformerBlock: 2-26                                       [2, 39, 256]              --
│    │    └─Sequential: 3-18                                      [2, 39, 256]              1,588,736
├─Linear: 1-7                                                     [2, 39, 1024]             263,168
├─BatchNorm1d: 1-8                                                [2, 1024, 39]             2,048
├─ReLU: 1-9                                                       [2, 39, 1024]             --
├─Linear: 1-10                                                    [2, 39, 256]              262,400
├─TransformerDecoder: 1-11                                        [30, 2, 256]              --
│    └─ModuleList: 2-1                                            --                        --
│    │    └─TransformerDecoderLayer: 3-19                         [30, 2, 256]              1,578,752
│    │    └─TransformerDecoderLayer: 3-20                         [30, 2, 256]              1,578,752
│    │    └─TransformerDecoderLayer: 3-21                         [30, 2, 256]              1,578,752
│    │    └─TransformerDecoderLayer: 3-22                         [30, 2, 256]              1,578,752
│    │    └─TransformerDecoderLayer: 3-23                         [30, 2, 256]              1,578,752
│    │    └─TransformerDecoderLayer: 3-24                         [30, 2, 256]              1,578,752
├─Linear: 1-12                                                    [30, 2, 40]               10,280
├─Linear: 1-13                                                    [2, 39, 40]               10,280
===================================================================================================================
Total params: 35,711,056
Trainable params: 35,711,056
Non-trainable params: 0
Total mult-adds (G): 1.40
===================================================================================================================
Input size (MB): 0.10
Forward/backward pass size (MB): 94.16
Params size (MB): 142.84
Estimated Total Size (MB): 237.11
===================================================================================================================

The specific model code

#numClasses = 41(<BOS>+<EOS>+38characters+<PAD>)
class AVSRwithConf2(nn.Module):
    def __init__(self, numClasses):
        super(AVSRwithConf2, self).__init__()
        self.e = 17
        self.d_k = 256
        self.logmel_dim = 80
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.pos_encode = PositionalEncoding(dModel=self.d_k, maxLen=1000)

        #Conv2Dsubsampling module
        self.sequential = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=self.d_k, kernel_size=3, stride=2),
            nn.ReLU(),
            nn.Conv2d(self.d_k, self.d_k, kernel_size=3, stride=2),
            nn.ReLU(),
        )
        self.input_projection = nn.Sequential(
            Linear(self.d_k * (((self.logmel_dim - 1) // 2 - 1) // 2), self.d_k),
            self.pos_encode,
            nn.Dropout(p=0.1),
        )

        self.linear1 = nn.Linear(512, self.d_k)

        #Conformer Encoder
        self.layers_A = nn.ModuleList([ConformerBlock(encoder_dim=self.d_k).to(self.device) for _ in range(self.e)])

        self.linear3 = nn.Linear(256, 4 * self.d_k)
        self.linear4 = nn.Linear(1024, self.d_k)
        self.bn1 = nn.BatchNorm1d(num_features=4 * self.d_k)
        self.embeddings = nn.Embedding(numClasses, self.d_k)
        self.relu = nn.ReLU()

        #Transformer Decoder
        self.decoder_layer = nn.TransformerDecoderLayer(d_model=self.d_k, nhead=8)
        self.decoder = nn.TransformerDecoder(self.decoder_layer, num_layers=6)

        #<BOS> is not necessary, then numClasses-1
        self.linear5 = nn.Linear(256, numClasses-1)
        self.linear6 = nn.Linear(256, numClasses-1)
        

    def forward(self, x, tgt, tgt_mask, tgt_padding_mask):
        # x (Batch, T, 80)
        # tgt (Batch, S)
        # tgtmask (S, S)
        # tgt_paddingmask (Batch, S)
    
        # Conv2Dsubsampling  https://github.com/sooftware/conformer/blob/main/conformer/convolution.py
        x = self.sequential(x.unsqueeze(1))
        batch_size, channels, subsampled_lengths, sumsampled_dim = x.size()
        x = x.permute(0, 2, 1, 3) 
        x = x.contiguous().view(batch_size, subsampled_lengths, channels * sumsampled_dim) 
        x = self.input_projection(x)


        #tgt embedding module
        tgt = self.embeddings(tgt) #tgt=(B, T), embedding=(B,T,D)
        tgt = tgt.transpose(0,1) #(T,B,D)
        tgt = self.pos_encode(tgt)

        #conformer block    
        for layer in self.layers_A:
            x = layer(x)

        # MLP
        x = self.linear3(x) # N,T,C
        x = x.transpose(1,2)
        x = self.bn1(x) # N,C,T
        x = x.transpose(1,2)
        x = self.relu(x)
        x = self.linear4(x)

        # to CE
        to_CE = x.transpose(0,1)
        to_CE = self.decoder(tgt=tgt, memory=to_CE, tgt_mask=tgt_mask, tgt_key_padding_mask=tgt_padding_mask)
        to_CE = self.linear5(to_CE)

        # to CTC
        x = self.linear6(x)
        x = F.log_softmax(x, dim=2) 

        to_CE = to_CE.transpose(0,1)
        return to_CE, x #(B,T=len(tgt), C),(B, T=len(frame), C)

Loss

I used Hybrid CTC/Attention architecture.(ramda = 0.1)

Loss_function = (ramda)*nn.CTCLoss(zero_infinity=True) + (1-ramda)*nn.CrossEntropyLoss(ignore_index=0)

Dataset

Trained using "pretrain" and "train" in the LRS2 dataset.

Speech was transformed into log-mel spectrograms using librosa. It was standardized by dividing by the maximum absolute value.

audioParams = {"Window":"hann", "WinLen":512, "Shift":160, "Dim":80}

# load file
sampFreq, inputAudio = wavfile.read(audioFile)
inputAudio = inputAudio.astype(np.float64)
# mel spectrogram
mel = librosa.feature.melspectrogram(y=inputAudio,
                                    sr=sampFreq,
                                    n_mels=audioParams["Dim"],
                                    window = audioParams["Window"],
                                    n_fft=audioParams["WinLen"],
                                    hop_length=audioParams["Shift"]) 

# log
log_mel = librosa.power_to_db(mel) # (n_mels, T)
# normalize
log_mel = log_mel / np.max(np.abs(log_mel))

The target was also transformed into a numeric value according to the character. The log-mel spectrogram and target are each zero-padded to fit the shape of each batch

#character to index mapping
CHAR_TO_INDEX = {" ":1, "'":22, "1":30, "0":29, "3":37, "2":32, "5":34, "4":38, "7":36, "6":35, "9":31, "8":33,
                "A":5, "C":17, "B":20, "E":2, "D":12, "G":16, "F":19, "I":6, "H":9, "K":24, "J":25, "M":18,
                "L":11, "O":4, "N":7, "Q":27, "P":21, "S":8, "R":10, "U":13, "T":3, "W":15, "V":23, "Y":14,
                "X":26, "Z":28, "<EOS>":39, "<BOS>":40}    

# zero padding
PAD_IDX = 0

Recognition Result

I tried to recognize it with training data. Since it is training data, it should be able to recognize correctly.

However, the CTC output was something like

"I E E E E O E E".

In the decoder, the output was almost the same for all the data, and became something like

<BOS>I WAS TE THEN THENOROR ICASE T TH A A A AN AN A AN AN AN ALIS ATHE AN A ARE AN A AN AS AS AN AS AN AS AN AN AS AN ASTHELIANDE ANDE ANDE ANDE ANDE ANDE ANDE A ANDE A ANDE ANDE A ANDE ANDE A ANDE A ANDES A ANDES A ANDES AN A ANDES A AN A ANDES A AN A ANDES A ANDES A ANDES AN A A ANDES A ANDES AN A A ANDES A ANDES A ANES A A ANDES A ANDES A A ANES A A ANDEVEVES ARES ARES ARES ARES AR<EOS>

Both are doing greedy decode.

Trained E2E speech recognition model does not recognize even training data correctly

Model

Loss

Dataset

Recognition Result

0 Answers0