I am confused with the difference between Kearas Applications such as (VGG16, Xception, ResNet50 etc..) and (RCNN, Faster RCNN etc...). Beause in some places it is mentioned that ResNet50 is just a feature extractor and FasterRCNN/RCN, YOLO and SSD are more like "pipeline" What is the difference between Resnet 50 and yolo or rcnn?. While in the Keras website they refer to (ResNet50, VGG16, Xception etc...) as a deep learning models https://keras.rstudio.com/articles/applications.html. So, can anyone tell me the difference between these in the most simplest form.
-
1I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). – desertnaut Oct 07 '21 at 12:06
1 Answers
You may need some background in image processing and computer vision in order to understand what each definition means.
In short,
VGG16, ResNet-50, and others are deep architectures of convolutional neural networks for images.
Such architectures are usually trained to classify an image into a category, out of 1000 possible categories (look up the ImageNet CLS-LOC challenge for more information about the categories).
Such architectures will usually "consume" an RGB image of size say 224x224 and will use convolutional layers to extract visual features from it in 5 different scales (you may need computer vision / machine learning background to understand this sentence). The image width and height will be "shrinked 2x" through the network 5 times, such that at the end of the network, the width and height will be 32x smaller than the original image, i.e., 7x7 in our case (note that 2^5 = 32).
In the original classification network, the e.g. 7x7 output is then aggregated and is used to train a final classifier that will predict the classification score of each of the 1000 classes.
However, back to your question, object detectors exploit the fact that there are "deep feature maps" of size 7x7 (and 14x14, 28x28 in the earlier layers) to apply different "heads", which are trained to do other tasks apart from classification, usually localized tasks, since the feature maps give you localized information. Such tasks include object detection, instance segmentation, and others.
Faster R-CNN, YOLO and SSD are all examples for such object detectors, which can be built on top of any deep architecture (which is usually called "backbone" in this context). For example, you can have a ResNet-50-based SSD object detector and a VGG-16-based SSD object detector. The better the backbone is, the better the performance of the detector usually is, as it can use better visual features for the task it is trained to do. RCNN is a way older approach that is by far slower and less accurate than modern object detectors that are trained using deep learning.

- 11,491
- 17
- 68
- 126