can't resume training with detectron2

Question

I am training a model on Faster R CNN architecture. For the first session I used the below config:

def get_train_cfg(config_file_path, checkpoint_url, train_dataset_name, test_dataset_name, num_classes, device, output_dir):
  
    cfg = get_cfg()
    cfg.merge_from_file(model_zoo.get_config_file(config_file_path))
    cfg.MODEL_WEIGHTS = model_zoo.get_checkpoint_url(checkpoint_url)
    cfg.DATASETS.TRAIN = (train_dataset_name,)
    cfg.DATASETS.TEST = (train_dataset_name,)

    cfg.DATALOADER.NUM_WORKERS = 2
    cfg.SOLVER.IMS_PER_BATCH = 1
    cfg.SOLVER.BASE_LR = 0.0001
    cfg.SOLVER.MAX_ITER = 16000
    cfg.SOLVER.STEPS = []

    cfg.MODEL.ROI_HEADS.NUM_CLASSES = num_classes
    cfg.MODEL.DEVICE = device
    cfg.OUTPUT_DIR = output_dir

    return cfg

I want to continue my training. I have last_checkpoint, metrics.json, cfg.pickle, and model_final.pth.

This is the link of my Notebook

The training should start at 16001 iterations and the total loss at 16000 iterations was about 0.8. But the learning rate didn't change from 0.0001 from 0 to 16000 iterations. When I continue training by resume_or_load(resume=True) The below error is shown

[12/04 05:54:36 d2.data.datasets.coco]: Loaded 381 images in COCO format from ../input/cascade-rcnn/train.json
[12/04 05:54:36 d2.data.build]: Removed 1 images with no usable annotations. 380 images left.
[12/04 05:54:36 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
[12/04 05:54:36 d2.data.build]: Using training sampler TrainingSampler
[12/04 05:54:36 d2.data.common]: Serializing 380 elements to byte tensors and concatenating them all ...
[12/04 05:54:36 d2.data.common]: Serialized dataset takes 2.23 MiB
[12/04 05:54:37 d2.engine.hooks]: Loading scheduler from state_dict ...
[12/04 05:54:37 d2.engine.train_loop]: Starting training from iteration 16000
[12/04 05:54:37 d2.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks)
[12/04 05:54:37 d2.data.datasets.coco]: Loaded 381 images in COCO format from ../input/cascade-rcnn/train.json
[12/04 05:54:38 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')]
[12/04 05:54:38 d2.data.common]: Serializing 381 elements to byte tensors and concatenating them all ...
[12/04 05:54:38 d2.data.common]: Serialized dataset takes 2.23 MiB
WARNING [12/04 05:54:38 d2.engine.defaults]: No evaluator found. Use DefaultTrainer.test(evaluators=), or implement its build_evaluator method.
[12/04 05:54:38 d2.utils.events]: iter: 16001 lr: N/A max_mem: 1627M

It shows

lr: N/A

Why is that?

I am using,

Python: 3.7.10,
Detectron2 : 0.6,
Torch : 1.9.1cu101

score 2 · Accepted Answer · answered Dec 06 '21 at 16:19

There is no error actually.

The problem is that your config specifies the maximum iteration as 16000.

cfg.SOLVER.MAX_ITER = 16000

However, based on the console output, it looks like your previous session had already completed 16000 iterations. So, the new session has nothing left to do.

Changing the maximum number of iterations to a value higher than 16000 (once you have read from the previously saved config) should continue the training like you expect it to.

Such a silly mistake. Thank you :-) – Meet Gondaliya Dec 06 '21 at 19:57 — Meet Gondaliya, Dec 06 '21 at 19:57

can't resume training with detectron2

1 Answers1