My checkpoint albert files does not change when training

Question

I train Albert model for question answering task. I have 200 thousand question-answer pairs and I use a saved checkpoint file with 2gb. I trained it on my GPU GeForce 2070 RTX with 1000 steps each time to save checkpoint, during training the checkpoint model.ckpt-96000.data-00000-of-00001 files just keep the size of 135MB and don't increase. Is this a problem?

I can't see why with a much smaller dataset like 1500 question-answer pairs, it also produces 135 MB checkpoint file. It hasn't stopped training yet but is it possible that the model will improve with this training?

score 1 · Accepted Answer · answered Oct 09 '20 at 10:52

1

While training your model you can store the weights in a collection of files formatted as checkpoints that contain only the weights trained in a binary format.

In particular, the checkpoints contain:

one or more blocks that contain the weights of our model
an index file indicating which weights are stored in a particular block

So the fact that the size of the checkpoint file is always the same depends on the fact that the model used is always the same. So the number of model parameters is always the same so the size of the weights you are going to save is always the same. While the suffix data-00000-of-00001 indicates that you are training the model on a single machine.

The size of the dataset, in my opinion, has nothing to do with it.

answered Oct 09 '20 at 10:52

Elidor00

1,271
13
27

Yes,I trained on a single machine, despite my large dataset, I could only train it with batch size of 8,I have gone through 400k steps and loss still is 3.5.Do you think I have any way to reduce it:(? – Việt Nguyễn Oct 10 '20 at 02:58
1

Without knowing what problem you are facing, the model you use, etc. it is very difficult to give useful advice ... If you want to try to increase batch size and can't do it using your gpu you could try "free" services like google colab and/or kaggle – Elidor00 Oct 10 '20 at 14:23

My checkpoint albert files does not change when training

1 Answers1