Cross Validation with coco data format json files

Question

I am a newbie ML learner and trying semantic image segmentation on google colab with COCO data format json and lots of images on google drive.

update

I borrowed this code as a starting point. So my code on colab is pretty much like this. https://github.com/akTwelve/tutorials/blob/master/mask_rcnn/MaskRCNN_TrainAndInference.ipynb

/update

I am splitting an exported json file into 2 jsons (train/validate with 80/20 ratio) every time I receive new annotation data. But this is getting tiring since I have more than 1000 annotations in a file and I do it manually with replace function of VS code.

Is there a better way to do this programatically on google colab?

So what I like to do is rotating annotation data without spitting a json file manually.

Say, I have 1000 annotations in ONE json file on my google drive, I would like to use the 1-800 annotations for training and the 801-1000 annotations for validating for the 1st train session, then for the next train session I would like to use the 210-1000 annotations for training and 1-200 annotations for validating. Like selecting a part of data in json from code on colab.

Or if I can rotate the data during one train session (K-Fold Cross Validation?), that is even better but I have no clue to do this.

Here is parts of my code on the colab.

Loading json files

dataset_train = CocoLikeDataset()
dataset_train.load_data('PATH_TO_TRAIN_JSON', 'PATH_TO_IMAGES')
dataset_train.prepare()

dataset_val = CocoLikeDataset()
dataset_val.load_data('PATH_TO_VALIDATE_JSON', 'PATH_TO_IMAGES')
dataset_val.prepare()

Initializing model

model = modellib.MaskRCNN(mode="training", config=config, model_dir=MODEL_DIR)

init_with = "coco"

if init_with == "imagenet":
    model.load_weights(model.get_imagenet_weights(), by_name=True)
elif init_with == "coco":
    model.load_weights(COCO_MODEL_PATH, by_name=True,
                       exclude=["mrcnn_class_logits", "mrcnn_bbox_fc", 
                                "mrcnn_bbox", "mrcnn_mask"])
elif init_with == "last":
    model.load_weights(model.find_last(), by_name=True)

train

start_train = time.time()
model.train(dataset_train, dataset_val, 
            learning_rate=config.LEARNING_RATE, 
            epochs=30, 
            layers='heads')
end_train = time.time()
minutes = round((end_train - start_train) / 60, 2)
print(f'Training took {minutes} minutes')

validate

start_train = time.time()
model.train(dataset_train, dataset_val, 
            learning_rate=config.LEARNING_RATE / 10,
            epochs=10, 
            layers="all")
end_train = time.time()
minutes = round((end_train - start_train) / 60, 2)
print(f'Training took {minutes} minutes')

json

{
  "info": {
    "year": 2020,
    "version": "1",
    "description": "Exported using VGG Image Annotator (http://www.robots.ox.ac.uk/~vgg/software/via/)",
    "contributor": "",
    "url": "http://www.robots.ox.ac.uk/~vgg/software/via/",
    "date_created": "Tue Jan 21 2020 16:18:14"
  },
  "images": [
    {
      "id": 0,
      "width": 2880,
      "height": 2160,
      "file_name": "sample01.jpg",
      "license": 1,
      "flickr_url": "sample01.jpg",
      "coco_url": "sample01.jpg",
      "date_captured": ""
    }
  ],
   "annotations": [
    {
      "id": 0,
      "image_id": "0",
      "category_id": 1,
      "segmentation": [
        588,
        783,
        595,
        844,
        607,
        687,
        620,
        703,
        595,
        722,
        582,
        761
      ],
      "area": 108199,
      "bbox": [
        582,
        687,
        287,
        377
      ],
      "iscrowd": 0
    }
  ],
  "licenses": [
    {
      "id": 1,
      "name": "Unknown",
      "url": ""
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "nail",
      "supercategory": "type"
    }
  ]
}

FYI, My workflow is like

Label images with VIA annotation tool
Export annotations in coco format json
Modify the json and save to my google drive
Load the json on colab and start training

It's going to be difficult to answer a question about splitting your data without the data itself or much information on it. If all you're asking is how to split the data, isn't most of that code irrelevant here? — AMC, Jan 22 '20 at 02:47
Thanks AMC, I just added a json and my wrokflow. But by assuming from your comment, data rotation is not something you do while training? — iiddaaa, Jan 22 '20 at 03:11
_But by assuming from your comment, data rotation is not something you do while training?_ How do you get that from my comment? I don't do any ML, so I have no idea. If you can provide a simple explanation of what that would involve, I can give it a try. — AMC, Jan 22 '20 at 03:13
It's good that you shared the data, but isn't clear to me what parts you want to split. I would rather not have to run and reverse engineer the code myself. — AMC, Jan 22 '20 at 03:14

score 0 · Accepted Answer · answered Jan 22 '20 at 03:49

There's a very good utility function in the sklearn library for doing exactly what you want here. It's called train_test_split.

Now, it's hard to understand what your data structures are, but I am assuming that this code:

dataset_train = CocoLikeDataset()
dataset_train.load_data('PATH_TO_TRAIN_JSON', 'PATH_TO_IMAGES')
dataset_train.prepare()

populates dataset_train with some kind of array of images, or else an array of the paths to the images. sklearn's train_test_split function is able to accept pandas DataFrames as well as numpy arrays.

I am usually very comfortable with pandas DataFrames, so I would suggest you combine the training and validation data into one DataFrame using the pandas function concat, then create a random split using the sklearn function train_test_split at the beginning of every training epoch. It would look something like the following:

import pandas as pd
from sklearn.model_selection import train_test_split

# Convert the data into a DataFrame
master_df = pd.concat([pd.DataFrame(dataset_train), pd.DataFrame(dataset_val)], ignore_index=True)

# Separate out the data and targets DataFrames' (required by train_test_split)
data_df = master_df[['image_data_col_1','image_data_col_2','image_data_col_3']]
targets_df = master_df[['class_label']]

# Split the data into a random train/test (or train/val) split
data_train, data_val, targets_train, targets_val = train_test_split(data_df, targets_df, test_size=0.2)

# Training loop
# If the training function requires the targets to be present in the same DataFrame, you can do this before beginning training:
dataset_train_df = pd.concat([data_train, targets_train], axis=1)
dataset_val_df = pd.concat([data_val, targets_val], axis=1)
##################################
# Continue with training loop...
##################################

Just one last note: ideally, you should have three sets - train, test, and validation. So separate out a testing set beforehand, and then do the train_test_split at the beginning of every iteration of the training loop to obtain your train-validation split from the remaining data.

Thanks @shinvu! I checked train_test_split link. Handling pandas dataframe set with scikit-learn library seems very promising. But now I am wondering how I integrate them to my current code. My ML knowledge is still very fragmentary. My code for training is almost identical with [this](https://github.com/akTwelve/tutorials/blob/master/mask_rcnn/MaskRCNN_TrainAndInference.ipynb) . Or if you could point me to document for that kind of topic, that would be appreciated. — iiddaaa, Jan 27 '20 at 04:16

Cross Validation with coco data format json files

1 Answers1