1

I want to train on a subset of COCO dataset. For the images, I have created a folder of first 30k images of train2017 folder. Now I need annotations of those 30k images (extracted from instances_train2017.json) in a separate json file so that I can train it.

How can I do it?

Capri
  • 13
  • 2
  • 7

2 Answers2

0

There is no simple way to this because the images for all of the annotations are in one long json file. I am working on Python package that can help with dataset preparation tasks including this one.

I have created an reproducible example in this notebook https://github.com/pylabel-project/samples/blob/main/coco_extract_subset.ipynb. You can open it directly in Google Colab using this link.

The package generally works like this:

from pylabel import importer
dataset = importer.ImportCoco(path_to_annotations)
#Now the annotations are stored in a dataframe 
#that you can query and manipulate like any other pandas dataframe
#In this case we filter the dataframe to images in a list of images 
dataset.df = dataset.df[dataset.df.img_filename.isin(files)].reset_index()
dataset.export.ExportToCoco()

I hope it works for you. Please let me know if you have any feedback.

alexheat
  • 479
  • 5
  • 9
0

A preliminary note:

COCO datasets are primarily JSON files containing paths to images and annotations for those images. So, if you wish to split your dataset you don't need to move your images into separate folders, but you should split the records contained in the JSON file. Doing this from scratch is not straightforward as the records have internal dependencies in the JSON file. The good news is there is a package named COCOHelper that can help you do that with very little effort!

Quick Solution:

You can split COCO datasets into subsets associated with their own annotations using COCOHelper. It is as simple as:

ch = COCOHelper.load_json(annotations_file, img_dir=image_dir)
splitter = ProportionalDataSplitter(70, 10, 20)  # split dataset as 70-10-20% of images
ch_train, ch_val, ch_test = splitter.apply(ch)
ch_train.write_annotations_file(fname)

A fully working example:

Imports + set up paths:

from pathlib import Path
from cocohelper import COCOHelper
from cocohelper.splitters.proportional import ProportionalDataSplitter

root_dir = Path('/data/robotics/oil_line_detection')
annotations_dir = root_dir / 'annotations'
annotations_file = annotations_dir / 'coco.json'
image_dir = ""

Create a cocohelper object, which represents your COCO dataset:

print(f"Loading dataset: {annotations_file}")
ch = COCOHelper.load_json(annotations_file, img_dir=image_dir)

Split the dataset (e.g. using a proportional data splitter, which splits data randomly):

splitter = ProportionalDataSplitter(70, 10, 20)
ch_train, ch_val, ch_test = splitter.apply(ch)
dest_dir = Path("./result")  # where to save the JSON files with annotations on the subset of images

for ch, ch_name in zip([ch_train, ch_val, ch_test], ["train", "val", "test"]):
    print(f"Saving dataset: '{ch_name}'")
    fname = dest_dir / f"{ch_name}.json"
    ch.write_annotations_file(fname)

More examples and details here.

gab
  • 792
  • 1
  • 10
  • 36