Convert Custom Vision Output to Tensorflow Object Detection API Visualization?

Question

I've produced a model using Azure Custom Vision and exported it as "Tensorflow - SavedModel". The model is being used locally using the helper code that was included in the export. It is slightly modified though to read from a live video feed capture using OpenCV VideoCapture()/Read() in a loop.

My application is making good detection against the live video feeds as I can see the results output correctly to the console, but I'm having trouble getting accurate bounding boxes to display correctly on the output video stream. The console output is showing the array of results from the Azure Custom Vision's model predictions and I can see the array of bounding box coordinates in what appears to be normalized values.

Prior to using Azure Custom Vision, I was able to use existing models from the "Model Zoo" and the Object Detection API python visualization helpers would display bounding boxes correctly on the feed's display.

However, the coordinates returned from Azure Custom Vision appear to be "different" than the ones returned by the default COCO SSD models?

I need to convert the bounding box coordinates returned from Azure Custom Vision to values understood by the Tensorflow Object Detection API visualization helpers.

Original Code using Object Detection API and COCO SSD Model (works!):

output_dict = run_inference_for_single_image(image_resize, graph)
                #Visualization of the results of a detection.
                vis_util.visualize_boxes_and_labels_on_image_array(
                    image_np,
                    output_dict['detection_boxes'],
                    output_dict['detection_classes'],
                    output_dict['detection_scores'],
                    category_index,
                    instance_masks=output_dict.get('detection_masks'),
                    use_normalized_coordinates=True,
                    line_thickness=2)
                cv2.imshow('object_detection', cv2.resize(image_np, (640, 480)))

Azure Custom Vision version not displaying boxes correctly:

    image_resize = cv2.resize(image_np, (512, 512))

    predictions = od_model.predict_image(Image.fromarray(image_resized))

    if len(predictions) > 0:
        print(predictions)                    
        output_dict = {}
        output_dict['detection_boxes'] = []
        output_dict['detection_boxes'] = [???]  <-- Populate with compatible shape!!!?????
        output_dict['detection_scores'] = np.asarray([ sub['probability'] for sub in predictions ])                
        output_dict['detection_classes'] = np.asarray([ sub['tagId'] for sub in predictions ])
        output_dict['detection_class_names'] = np.asarray([ sub['tagName'] for sub in predictions ])
        vis_util.visualize_boxes_and_labels_on_image_array(image_np,
              output_dict['detection_boxes'],
              output_dict['detection_classes'],
              output_dict['detection_scores'],
              category_index,
              instance_masks=output_dict.get('detection_masks'),
              use_normalized_coordinates=True,
              line_thickness=2)

-->  Console showing Custom Vision Model Response:
         {'probability': 0.95146583, 'tagId': 1, 'tagName': 'MyObject', 'boundingBox': {'left': 0.11083871, 'top': 0.65143364, 'width': 0.05332406, 'height': 0.04930339}}
         {'probability': 0.92589812, 'tagId': 0, 'tagName': 'OtherObject', 'boundingBox': {'left': 0.24750886, 'top': 0.68784532, 'width': 0.54308632, 'height': 0.17839652}}

Using the Azure Custom Vision model I can't seem to get bounding boxes to display correctly. I was able to convert the Custom Vision "boundingBox" to the same shape expected by the visualization, but the box would never be in the correct coordinates. I had thought it might be because the coordinate systems between the COCO SSD return tensor and the one in the Custom Vision prediction response tensors are calculated differently? Or maybe the coordinates are in a different order between the two shapes?

Has anyone already solved this translation? Am I doing it wrong? Thanks in Advance!

score 0 · Answer 1 · answered Jul 02 '20 at 22:48

Well, answering my own question... was able to get this by brute force...now all of my Custom Vision responses translate to the COCO SSD Tensorflow Object Detection API output. It boils down to two things:

Create a compatible .pbtxt file matching the order of the .txt file exported from the Custom Vision Portal 'Export'. I'm using SavedModel... Note though that .pbtxt files are 1 based index start while the tagId returned in Custom Vision models are zero based. This was easily solved by just adding a "+1" to the tagId when assigning to the detected_classes key in the dictionary and letting the TF OD API do the rest by assigning human readable labels to the bounding boxes on the live feeds.
Bounding box coordinates kung-fu...! After I shifted the results from the Custom Vision API into an 'array of arrays' matching the original TF model, and calculated the xmax and ymax.., I found the Object Detection API Visualization bits eventually converted to a 'box' in the shape of the xmin, ymin, xmax, ymax in that order (don't try xmin, xmax, ymin, ymax for example in your conversion). The Custom Vision API response returns xmin, ymin, width, and height of it's bounding box. I've seen this called out in several answers on here, but usually in the context of something else. The ImageNet/Resnet models by default return different shapes as well. Not answering that one here because I have no need for it currently, but I'm sure a similar brute force approach would work with that model if needed.

Anyway... some code...

        ret, image_np = cap.read()
        if image_np is None:
            continue

        predictions = od_model.predict_image(Image.fromarray(image_np))
        
        if len(predictions) > 0:                
            output_dict = {}            
            output_dict['detection_boxes'] = []

            output_dict['detection_scores'] = np.asarray([ sub['probability'] for sub in predictions ])                
            output_dict['detection_classes'] = np.asarray([ sub['tagId'] for sub in predictions ]) **+ 1**
            output_dict['detection_class_names'] = np.asarray([ sub['tagName'] for sub in predictions ])

            for p in predictions:
                print(p)        #for debugging purposes...            
                box_left = p['boundingBox']['left']
                box_top = p['boundingBox']['top']
                box_height = p['boundingBox']['height'] 
                box_width = p['boundingBox']['width']
                output_dict['detection_boxes'].append(np.asarray(( box_top, box_left, box_top+box_height, box_left + box_width)))
                
            output_dict['detection_boxes'] = np.asarray(output_dict['detection_boxes'])

            vis_util.visualize_boxes_and_labels_on_image_array(
                    image_np,
                    output_dict['detection_boxes'],
                    output_dict['detection_classes'],
                    output_dict['detection_scores'],
                    category_index,
                    min_score_thresh=0.949,                        
                    instance_masks=output_dict.get('detection_masks'),
                    use_normalized_coordinates=True,
                    line_thickness=2)
        cv2.imshow('object_detection', cv2.resize(image_np, (VID_W, VID_H)))

Convert Custom Vision Output to Tensorflow Object Detection API Visualization?

1 Answers1