How to speedup caffe for sliding window object detection in a test image

Question

I have trained a convolutional neural network (CNN) to determine/detect if an object of interest is present or not in a given image patch.

Now given a large image, i am trying to locate all occurrences of the object in the image in a sliding window fashion by applying my CNN model to the patch surrounding each pixel in the image. However this is very slow.

The size of my test images is (512 x 512). And, for my caffe net, the test batch size is 1024 and the patch size is (65 x 65 x 1).

I tried to apply my caffe net on a batch of patches (size = test_batch_size) instead of a single patch at a time. Even then it is slow.

Below is my current solution that is very slow. I would appreciate any other suggestions other than down-sampling my test image to speed this up.

Current solution that is very slow:

def detectObjects(net, input_file, output_file):

    # read input image
    inputImage = plt.imread(input_file)

    # get test_batch_size and patch_size used for cnn net
    test_batch_size = net.blobs['data'].data.shape[0]
    patch_size = net.blobs['data'].data.shape[2]

    # collect all patches    
    w = np.int(patch_size / 2)

    num_patches = (inputImage.shape[0] - patch_size) * \
                  (inputImage.shape[1] - patch_size)

    patches = np.zeros((patch_size, patch_size, num_patches))
    patch_indices = np.zeros((num_patches, 2), dtype='int64')

    count = 0

    for i in range(w + 1, inputImage.shape[0] - w):
        for j in range(w + 1, inputImage.shape[1] - w):

            # store patch center index
            patch_indices[count, :] = [i, j]

            # store patch
            patches[:, :, count] = \
                inputImage[(i - w):(i + w + 1), (j - w):(j + w + 1)]

            count += 1

    print "Extracted %s patches" % num_patches

    # Classify patches using cnn and write result to output image
    outputImage = np.zeros_like(inputImage)
    outputImageFlat = np.ravel(outputImage)

    pad_w = test_batch_size - num_patches % test_batch_size
    patches = np.pad(patches, ((0, 0), (0, 0), (0, pad_w)),
                     'constant')
    patch_indices = np.pad(patch_indices, ((0, pad_w), (0, 0)),
                           'constant')

    start_time = time.time()

    for i in range(0, num_patches, test_batch_size):

        # get current batch of patches
        cur_pind = patch_indices[i:i + test_batch_size, :]

        cur_patches = patches[:, :, i:i + test_batch_size]
        cur_patches = np.expand_dims(cur_patches, 0)
        cur_patches = np.rollaxis(cur_patches, 3)

        # apply cnn on current batch of images
        net.blobs['data'].data[...] = cur_patches

        output = net.forward()

        prob_obj = output['prob'][:, 1]

        if i + test_batch_size > num_patches:

            # remove padded part
            num_valid = num_patches - i
            prob_obj = prob_obj[0:num_valid]
            cur_pind = cur_pind[0:num_valid, :]

        # set output
        cur_pind_lin = np.ravel_multi_index((cur_pind[:, 0],
                                             cur_pind[:, 1]),
                                             outputImage.shape)

        outputImageFlat[cur_pind_lin] = prob_obj

    end_time = time.time()
    print 'Took %s seconds' % (end_time - start_time)

    # Save output
    skimage.io.imsave(output_file, outputImage * 255.0)

I was hoping that with the lines

    net.blobs['data'].data[...] = cur_patches
    output = net.forward()

caffe would classify all the patches in cur_patches in parallel using GPU. Wonder why it is still slow.

@Shai I am using a CNN. I figured out the problem. I was using the net_test.prototxt instead of the net_deploy.prototxt which lead to some funky behavior. I tuned the batch size in deploy mode, and with a batch size of 1000, i can able to do a dense classification of all the patches (~200000) in a 512 x 512 image in 9 seconds which i am happy with for now. Thanks a lot for your help with the generation of net_deploy.prototxt. — cdeepakroy, Dec 08 '16 at 23:27

score 1 · Answer 1 · edited May 23 '17 at 12:08

I think what you are looking for is described in the section Casting a Classifier into a Fully Convolutional Network of the "net surgery" tutorial.
What this solution basically says is that instead of conv layers followed by an "InnerProduct" layer for classification, the "InnerProduct" layer can be transformed into an equivalent conv layer, resulting with a fully convolutional net that can process images of any size and output a prediction according to the input size.
Moving to a fully convolutional architecture will significantly reduce the number of redundent computations you are currently make, and should significantly speed up your process.

Another possible direction for speedup is to approximate high-dimensional "InnerProduct" layers by a product of two lower rank matrices using truncated SVD trick.

score 0 · Answer 2 · answered Dec 14 '22 at 15:40

If you still use Caffe, I'd recommend trying OpenVINO to decrease inference time. OpenVINO optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. OpenVINO is optimized for Intel hardware, but it should work with any CPU.

The instruction on how to use it is below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.

pip install openvino-dev[caffe]

Use Model Optimizer to convert Caffe model

The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Caffe model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:

mo --input_model "model.caffemodel" --data_type FP32 --source_layout "[n,c,h,w]" --target_layout "[n,h,w,c]" --output_dir "model_ir"

Run the inference

The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. It seems you care about throughput, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"CUMULATIVE_THROUGHPUT"}) # alternatively THROUGHPUT 

# Get input and output layers
input_layer_ir = compiled_model_ir.input(0)
output_layer_ir = compiled_model_ir.output(0)

# Resize and reshape input image
height, width = list(input_layer_ir.shape)[1:3]
input_image = cv2.resize(input_image, (width, height))[np.newaxis, ...]

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

Disclaimer: I work on OpenVINO.

How to speedup caffe for sliding window object detection in a test image

2 Answers2