Ultralytics Yolov8 fails to train to detect objects

Question

I am trying to train Yolov8 to detect black dots on human skin. An example of skin and markup is shown below. I've cropped images to 256x256 pixels, selected those crops that have at least one label and got a train, test and validation datasets (4000, 2000 and 2000 images respectively). A segmentation model from segmentation_models_pytorch predicts black dots with IoU=0.15. This is decent, given that the markup is rectangular (while after image rotation augmentation the best guess is a circle) and that the object doesn't have a clear boundary (unlike a car or a chair).

I've tried training Yolov8 with command

yolo detect train data=D:\workspace\ultralytics\my_coco.yaml model=yolov8n.yaml epochs=100 imgsz=256 workers=2 close_mosaic=100 project='bd' flipud=0.5 mosaic=0.0 Each epoch reports

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
 19/100      1.44G        nan        nan        nan        372        256: 100%|██████████| 267/267 [00:44<00:00,  5.96it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 44/44 [00:06<00:00,  7.06it/s]
               all       1391      26278          0          0          0          0

Full output can be found here

https://pastebin.com/TLz32ZRv

A different command yields better results yolo detect train data=D:\workspace\ultralytics\my_coco.yaml model=yolov8n.pt epochs=100 imgsz=256 workers=2 close_mosaic=0 project='bd' flipud=0.5 mosaic=0.5

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  1/100      1.56G        nan        nan        nan        389        256: 100%|██████████| 267/267 [00:44<00:00,  5.96it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 44/44 [00:05<00:00,  8.08it/s]
               all       1391      26278     0.0118    0.00202    0.00747    0.00278
....

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  4/100      1.54G        nan        nan        nan        165        256: 100%|██████████| 267/267 [00:43<00:00,  6.12it/s]
             Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 44/44 [00:05<00:00,  7.65it/s]
               all       1391      26278     0.0118    0.00202    0.00747    0.00278

I still get nan for training after epoch and some values during epoch. For example: 6/100 1.05G 3.279 2.164 0.937 565 256: 47%|████▋ | 126/267 [00:20<00:22, 6.22it/s] Also, note that validation gets stuck at values from the first epoch.

Is the picture an example of the annotations? If so, this seems like an extremely hard task. Very small bounding boxes + non-consistent enough annotations. Will be very hard for you to get results at all. Some annotation are clear (the small and dark dots), other not at all. In the sense that there is no difference between them and the rest of the dots/lines in the image. — Mike B, May 25 '23 at 17:36
This is a two part problem. I've soleved the first part (nan problem). The training yielded mAP50=0.27, which is expected due to the complexity of the task. — sixtytrees, May 26 '23 at 10:29
I guess, mAP50=0.27 is realistic given the training data. WIll you refine the annotations? — Mike B, May 26 '23 at 11:21

score 0 · Accepted Answer · answered May 26 '23 at 10:46

The nan problem and lack of learning was caused by an issue with the environment. I've learned that you need to create a new environment and only run pip install ultralytics. During the first training it shows running on CPU. You might be tempted to install pytorch cuda from nvidia. Do not do this. Instead, reboot your computer and rerun the training. It will show running on GPU (assuming, you have Nvidia graphics card and Nvidia toolkit is installed).

Next. You must use if __name__ == "__main__": as shown below.

from ultralytics import YOLO

if __name__ == "__main__":  # this is crucial
    model = YOLO('yolov8n.pt')
    model.train(data='my.yaml', epochs=1000, imgsz=256, workers=1)

Otherwise you start a fork bomb.

Next. Start with a pretrained model and try their coco8.yaml. The prediction accuracy will steadily drop during training. This is normal: coco8 is a very small dataset, so you run into catastrophic forgetting of CNN.

Then download Pascal VOC 2007. Parse it. Ultralytics uses json, while Pascal VOC uses xml (and they use different notation for coordinate boxes). Train your pretrained model to detect persons only. Once again, there will be initial drop in performance. But after some 20 epochs you get steady improvement (which doesn't quite reach the metrics of the original model, but is close. Now you can tweak the parameters in the default.yaml

Once you are OK with training yolov8 on your dataset.

P.S. there were ten more hickups (some of them took hour to Google), but I am not sure, how many of those were caused by pytorch packages conflict.

Ultralytics Yolov8 fails to train to detect objects

1 Answers1