I have written a model, the architecture is follows:
CNNLSTM(
(cnn): CNNText(
(embed): Embedding(19410, 300, padding_idx=0)
(convs1): ModuleList(
(0): Conv2d(1, 32, kernel_size=(3, 300), stride=(1, 1))
(1): Conv2d(1, 32, kernel_size=(5, 300), stride=(1, 1))
(2): Conv2d(1, 32, kernel_size=(7, 300), stride=(1, 1))
)
(dropout): Dropout(p=0.6)
(fc1): Linear(in_features=96, out_features=1, bias=True)
)
(lstm): RNN(
(embedding): Embedding(19410, 300, padding_idx=0)
(rnn): LSTM(300, 150, batch_first=True, bidirectional=True)
(attention): Attention(
(dense): Linear(in_features=300, out_features=1, bias=True)
(tanh): Tanh()
(softmax): Softmax()
)
(fc1): Linear(in_features=300, out_features=50, bias=True)
(dropout): Dropout(p=0.5)
(fc2): Linear(in_features=50, out_features=1, bias=True)
)
(fc1): Linear(in_features=146, out_features=1, bias=True)
)
I have used the RNN and the CNN differently on the same dataset and I have the weights saved. In the mixed model, I load the weights using the following function:
def load_pretrained_weights(self, model='cnn', path=None):
if model not in ['cnn', 'rnn']:
raise AttributeError("Model must be either rnn or cnn")
if model == 'cnn':
self.cnn.load_state_dict(torch.load(path))
if model == 'rnn':
self.lstm.load_state_dict(torch.load(path))
And freeze the sub modules using the function:
def freeze(self):
for p in self.cnn.parameters():
p.requires_grad = False
for p in self.lstm.parameters():
p.requires_grad = False
Then I train the model, and got better result compared to the each submodule trained and evaluated alone. I used an early-stopping technique in my epoch loop to save the best parameters. After training I made a new instance of the same class and when I load the saved “best” parameters I am not getting similar result. I tried the same thing with each submodule (RNN and CNNText here) alone, it worked. But in this case it is not giving the same performance.
Please help me understand it what is happening here. I am new to Deep Learning concepts. Thank you.
Few Experiments I tried:
- I loaded the saved weights of each submodule and loaded the best parameters, got somehow close to the best result.
- Took the hidden layer from each submodule before applying the dropout, got better than the previous, but not the best!
EDIT
The init function of my class is as follows. And the RNN and CNN are just usual implementations.
class CNNLSTM(nn.Module):
def __init__(self, vocab_size, embedding_dim, embedding_weight, rnn_arch, isCuda=True, class_num=1, kernel_num=32, kernel_sizes=[3,4,5],train_wv=False, rnn_num_layers=1, rnn_bidirectional=True, rnn_use_attention=True):
super(CNNLSTM, self).__init__()
self.cnn = CNNText(vocab_size, embedding_dim, embedding_weight, class_num, kernel_num = kernel_num, kernel_sizes=kernel_sizes, static=train_wv,dropout=0.6)
self.lstm = RNN(rnn_arch, vocab_size, embedding_dim, embedding_weight, num_layers=rnn_num_layers, rnn_unit='lstm', embedding_train=train_wv, isCuda=isCuda, bidirectional=rnn_bidirectional, use_padding=True, use_attention=rnn_use_attention, num_class=class_num)
self.fc1 = nn.Linear(rnn_arch[-1] + len(kernel_sizes) * kernel_num , class_num)
After declaring the object
I Loaded individual pre-trained submodule as,
model.load_pretrained_weights('rnn', 'models/bilstm_2_atten.pth')
model.load_pretrained_weights('cnn', 'models/cnn2.pth')
model.freeze()
Then I train the last linear layer. I saved the model parameter values as
torch.save(model.state_dict(),path)
So at 3rd/4th from last epoch I am getting the 'best' result. And after training I loaded the parameters for best result with,
state_dict = torch.load(MODEL_PATH)
model.load_state_dict(state_dict)