I am working on a a project where I built an LSTM model for seq2seq, where I have a synced sequence input and output. My audio time series is 32000 in length and my labels are also 32000 in length. And we wish to make a classification (fake or real audio) decision on each sample of the audio. So my tensors look like this for 1 audio example
time series
tensor([-0.1635, -0.1510, -0.1455, ..., -0.3338, -0.3163, -0.2944])
labels 0s and 1s corresponding to each sample
tensor([0, 0, 0, ..., 0, 0, 0])
In all the audios, there is about 45-50% of the audio samples that is labeled 1 for fake, and basically the other half labeled 0 for real. So class imbalance isn't an issue here. Also, the fake audio will always be in a certain region of the audio. So for example in a audio time series of length 32000, the fake region could be from 22000 to 30000, or 5000 to 24000 just to give two examples.
I initially tried this problem with an LSTM that takes a fixed input size of 32000 and trained it. What I noticed is that my model was only predicting one class and the loss was never decreasing, after like 5 epochs even. I then read that LSTMs with long input sequences like the audio that I have, is prone to vanishing gradients and possibly an unlearnable model. So I changed my training method so that for each batch in the Pytorch Dataloader, we split the audio time series of 32000 in length to smaller chunks (I tried 100, 200, 500, 1000...etc) and we do backpropogation on these smaller chunks. With this method my model was able to have some recall during training so it was predicting both classes, but still the LOSS was not decreasing at all, even after like 5 epochs.
class LSTM(nn.Module):
def __init__(self, input_size, hidden_size, num_layers, output_dim = 2, batch_size = batch_size):
super(LSTM , self).__init__()
self.num_layers = num_layers
self.input_size = input_size
self.hidden_size = hidden_size
self.batch_size = batch_size
self.rnn_state = None
self.lstm = nn.LSTM(self.input_size , self.hidden_size , self.num_layers, batch_first = True)
self.to_linear = nn.Linear(self.hidden_size, output_dim)
def repackage_rnn_state(self):
self.rnn_state = self._detach_rnn_state(self.rnn_state)
def _detach_rnn_state(self, h):
if isinstance(h, torch.Tensor):
return h.detach()
else:
return tuple(self._detach_rnn_state(v) for v in h)
def forward(self, x):
lstm_out, self.rnn_state = self.lstm(x , self.rnn_state)
logits = self.to_linear(lstm_out)
scores = F.softmax(logits, dim = 2)
return scores
for epoch in range(num_epochs):
running_loss_train , accuracy_train = 0 , 0
network.hidden = network.init_hidden()
for batch_idx, batch in enumerate(train_loader):
new_time_series , new_labels = batch
new_time_series, new_labels = new_time_series.to(device), new_labels.to(device)
optimizer.zero_grad()
# forward pass feeding the padded sequence
preds = network(new_time_series.unsqueeze(2))
preds = preds.view(preds.shape[0]*preds.shape[1], preds.shape[2])
train_loss = criterion(preds, new_labels.flatten().long())
train_loss.backward()
# gradient clipping here just before the parameter updates, clips inplace, clip value = 1
torch.nn.utils.clip_grad_norm_(network.parameters(),clip)
optimizer.step()
running_loss_train+=train_loss.detach().item()/new_time_series.shape[0]
_ ,predictions = torch.max(preds,1)
metrics = get_evaluations_ignoring_padding(new_labels, predictions, 2)
accuracy_train+=metrics['Accuracy']
if batch_idx%100==0:
print('Epoch : %d | BatchID : %d | Train Loss : %.3f | Train Acc : %.3f ' %
(epoch , batch_idx, train_loss.item() / new_time_series.shape[0] ,
metrics['Accuracy']))
The above code is for when I did the "truncated" backpropogation splitting my time series into multiple chunks. So before I didn't have the repackage_rnn_state or _detach_rnn_state when I trained on the 32000 length. Doing it the original way and with the chunk method again both did not decrease loss.
I previously did a many-to-one classification on real vs fake audio, where half of the audio didn't have any fake regions and the other Half has some fake regions. I got 99% accuracy on that CNN model and only after 3 epochs. So I'm thinking my data is pretty easy to train on and just because it's now a many-to-many problem there shouldn't be learning issues.
Is there something wrong with my code??
Is it not appropriate to train an LSTM on my kind of problem?
I'd like some advice about how you would go about training this kind of task.
Please let me know if you need more information and I will update my question and thanks