1

Here is the question: I am loading my pytorch model(best result on dev set while training) from the checkpoint file during model evaluation, remembering to do model.eval() and with torch.no_grad(), I still get a lower accuracy result(with 1-2% drop) on dev set compared with which I get while training.

I have tried:

  • printing the state dict before pytorch save the best result model during training, compared with what I get while loading, which is the same.
  • check my code, which use lots of dropout and layernorm layers, and get no error.
  • load model on the same GPU but nothing helpful.

My working environment:

  • Python 3.6.10, Pytorch 1.7.1(with cuda 11.1)
  • GPU: NVIDIA 2080Ti
  • use the same seed(numpy and pytorch) during training and evaluation
  • use model.eval() and with torch.no_grad() on dev set during both model training and evaluating.
  • the same dev set and the same metric calculation method.

Here is my pseudocode during training(the original one is too heavy):

# load my data.
train_dataset = FinetuningDataset(vocab=vocab, domains=domains, data_files=data_files, max_len=data_config["max_len"], giga_embedding_vocab=giga_embedding.word2id)

val_dataset = FinetuningDataset(vocab, domains=domains, data_files=dev_data_path, max_len=data_config['max_len'], giga_embedding_vocab=giga_embedding.word2id)

sp_collator = SortPadCollator(sort_key=lambda x:x[0], ignore_indics=[0])   
train_iter = DataLoader(dataset=train_dataset,  
                        batch_size=data_config["batch_size"], 
                        shuffle=data_config["shuffle"],
                        collate_fn=sp_collator)
val_iter = DataLoader(dataset=val_dataset,  
                    batch_size=data_config["batch_size"], 
                    shuffle=data_config["shuffle"], 
                    collate_fn=sp_collator)
adatrans = AdaTrans(vocab=vocab, config=model_config, domain_size=len(domains))
adatrans.load_state_dict(torch.load('ckpt_adatrans/litebert_1e-3_50cls_cuda2.pt'))
model = MixLM(adatrans=adatrans, vocab=vocab, config=model_config, giga_embedding=giga_embedding)

# this is my loss function during training.
loss_fn_dct = {"mask_loss": neg_log_likelihood_loss, "emb_mse_loss":nn.MSELoss(reduction='none'), "domain_cls_loss":nn.NLLLoss(reduction='none')}
metrics_fn_dct = {"mask_metrics":accuracy}

# build a trainer.
trainer = ftTrainer(loss_fn_dct=loss_fn_dct, metrics_fn_dct=metrics_fn_dct, config=trainer_config)
# gets best result on dev set and save it to checkpoint.pt
best_res, best_state_dict = trainer.train(model=model, train_iter=train_iter, val_iter=val_iter, optimizer=trainer_config['optimizer'], device=trainer_config['device'])
print("best result:: ", best_res)
trainer.save(best_state_dict, trainer_config['model_path'])

and in trainer.py, I save the best state dict result and return:

model.eval()
for dev_batch in val_iter:
    with torch.no_grad():
      # this self.val() runs model forward function and return prediction result.
      dev_res = self.val(dev_batch, model, device)
      dev_loss += dev_res['loss'].item()
# this function gets result metric.(which drops during evaluation.)
dev_metric = model.domain_biaffine._attachment_scores.get_metric(reset=True)
if dev_metric['UAS'] > best_UAS:
    best_UAS = dev_metric['UAS']
    best_res, best_state_dict = dev_metric, model.state_dict()

print("dev_loss: ", dev_loss / cnt_iter)
print("dev metric: ", dev_metric)

In evaluation.py, I just load the checkpoint.pt and make prediction:

test_dataset = FinetuningDataset(vocab=vocab, domains=domains, data_files=data_files, max_len=data_config["max_len"], giga_embedding_vocab=giga_embedding.word2id)

sp_collator = SortPadCollator(sort_key=lambda x:x[0], ignore_indics=[0])   

test_iter = DataLoader(dataset=test_dataset,  
                        batch_size=data_config["batch_size"], 
                        shuffle=False,
                        collate_fn=sp_collator)

adatrans = AdaTrans(vocab=vocab, config=model_config, domain_size=len(domains))
model = MixLM(adatrans=adatrans, vocab=vocab, config=model_config, giga_embedding=giga_embedding)

# load pytorch checkpoint.pt
model.load_state_dict(torch.load(data_config['model_path'], torch.device('cuda:1')), strict=True)

trainer = ftTrainer(config=trainer_config, vocab=vocab, id2word=giga_embedding.id2word)
# this line makes prediction, which do model.forward and print metric(which is the same as the trainer.py snippet.)
trainer.inference(model=model, test_iter=test_iter, device=trainer_config['device'])

I have been searching for a long time on Google, but got nothing helpful. This totally bothering me. Could anyone help me with it? Thanks in advance!

switchsyj
  • 63
  • 1
  • 8

0 Answers0