3

I Was training my NER model with transformers, and am not really sure why the training stopped at some point, or why did it even go with so many batches. This is how my configuration file looks like (relevant part):

[training]
train_corpus = "corpora.train"
dev_corpus = "corpora.dev"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 2
max_steps = 0
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.00005

And this is the training log:

============================= Training pipeline =============================
ℹ Pipeline: ['transformer', 'ner']
ℹ Initial learn rate: 5e-05
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0         398.75     40.97    2.84    3.36    2.46    0.03
  0     200         906.30   1861.38   94.51   94.00   95.03    0.95
  0     400         230.06   1028.51   98.10   97.32   98.89    0.98
  0     600          90.22   1013.38   98.99   98.40   99.58    0.99
  0     800          80.64   1131.73   99.02   98.25   99.81    0.99
  0    1000          98.50   1260.47   99.50   99.16   99.85    1.00
  0    1200          73.32   1414.91   99.49   99.25   99.73    0.99
  0    1400          84.94   1529.75   99.70   99.56   99.85    1.00
  0    1600          55.61   1697.55   99.75   99.63   99.87    1.00
  0    1800          80.41   1936.64   99.75   99.63   99.87    1.00
  0    2000         115.39   2125.54   99.78   99.69   99.87    1.00
  0    2200          63.06   2395.48   99.80   99.75   99.85    1.00
  0    2400         104.14   2574.36   99.87   99.79   99.96    1.00
  0    2600          86.07   2308.35   99.88   99.79   99.97    1.00
  0    2800          81.05   1853.15   99.90   99.87   99.93    1.00
  0    3000          52.67   1462.61   99.96   99.93   99.99    1.00
  0    3200          57.99   1154.62   99.94   99.91   99.97    1.00
  0    3400         110.74    847.50   99.90   99.85   99.96    1.00
  0    3600          90.49    621.99   99.90   99.91   99.90    1.00
  0    3800          51.03    378.93   99.87   99.78   99.97    1.00
  0    4000          93.40    274.80   99.95   99.93   99.97    1.00
  0    4200         138.98    203.28   99.91   99.87   99.96    1.00
  0    4400         106.16    127.60   99.70   99.75   99.64    1.00
  0    4600          70.28     87.25   99.95   99.94   99.96    1.00
✔ Saved pipeline to output directory
training/model-last

I was trying to train my model for 2 epochs (max_epochs=2), and my train file has around 123591 Examples, and dev file has 2522 Examples.

My question is:

  • Since my minimum batch size is 100, I expect my training to end before the 2400th eval batch, right? Because 2400th batch evaluated implies I have a minimum of 2400*100 = 240000, and it would actually be even more than that, since my batch size is increasing. So why did it go all the way to # 4600?

  • The training ended automatically, but the E still reads the 0th epoch. Why is that?

Edit: In continuation to my 2nd bullet point, I'm curious to know why did the training went all the way upto 4600 batches, because 4600 batches at minimum means 4600*100 = 460000 examples, and I gave 123591 examples for train, so I'm clearly well above and over the 1st epoch, but E still reads as 0.

archity
  • 562
  • 3
  • 11
  • 22

3 Answers3

4

There's an entry for this in the FAQ, but to summarize:

  • max_steps is the maximum iterations. (Not "evaluation iterations", but batches.)
  • max_epochs is the maximum number of epochs.
  • If training goes for patience batches without improvement it will stop. That is what stopped your training.

It seems like your model has already gotten a perfect score so I'm not sure why early stopping is a problem in this case, but that's what's happening.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • That's all fine, but my question still doesn't seem to be answered, specifically those 2 bullet points including the last edit statement. I specifically wanted to know why training went all the way upto 4600 batches, while still being at the 0th epoch. – archity Jul 22 '21 at 12:42
  • At 3000 batches your model got a score of 99.99. 1600 batches later it hadn't gotten a better score so it stopped because of patience. Epoch number was irrelevant for that. – polm23 Jul 22 '21 at 15:56
  • 1
    As to why it was still the zero epoch, you are using `batch_by_words`, so the 100 is not the number of documents, it is the number of **words**. So if you have one document with 100 words that could be a batch by itself. – polm23 Jul 22 '21 at 15:58
  • So if I want the batcher to batch by documents rather than words, should I instead use "spacy.batch_by_sequence.v1", or something else? – archity Jul 22 '21 at 17:18
  • 1
    Yes, you can use any batcher that isn't based on words (which I think is the rest of them at the moment.) Note that the current batcher is still batching documents - it will not slice a document in half - but that batch size is determined based on words. – polm23 Jul 23 '21 at 06:30
  • Okay so in my current configuration, the number of documents that will be processed in each batch will depend on the number of words allowed (100) and also on the number of words in each of the documents being processed (it can possibly take more than 1 document also in a batch if the documents have very few words). Am I correct on this interpretation? – archity Jul 23 '21 at 12:08
  • 1
    Yes, the batcher will add documents to the batch until it goes over 100 (or whatever) words, then it will stop. If a single document is too long it will add it to the batch anyway depending on your settings. This is all in the docs. https://spacy.io/api/top-level#batch_by_words – polm23 Jul 24 '21 at 02:48
3

I think your training is stopping due to: patience = 1600 telling the training to stop if there is no improvement in that many batches.

With my datasets, I have to bump the 'patience' up significantly. Otherwise (similarly to your case) I do not complete even the epoch 0. I just envy you your scores. I rarely get over 0.9 ...

The max_epochs=2 tells it to stop after 2 epochs, it is not a minimum.

mbrunecky
  • 176
  • 1
  • 6
  • Well yes, but why did the training go all the way up to 4600 batches with E still at 0? 4600 batches at minimum means 4600*100=460000 examples, but I only gave 123591 in train data, so it's clearly more than 1 epoch, right? Unless I'm misinterpreting it all? – archity Jul 21 '21 at 21:38
  • Because you were still in Epoch 0. There was improvement in the first 3000 steps, only in step 3000 to 4600 it did not see any further improvement. The evaluation/reporting is specified by eval_frequency = 200. – mbrunecky Jul 22 '21 at 23:16
1

Number of epochs. 0 means unlimited. If >= 0, train corpus is loaded once in memory and shuffled within the training loop. -1 means stream train corpus rather than loading in memory with no shuffling within the training loop.

This is given in the official spacy documentaion. https://spacy.io/usage/training

Rigel Tal
  • 31
  • 2