I'm planning to do it using YOLO for a CNN supervised regression task. Given an image, predict the number of times it will be viewed. I'm inclined on using YOLO as it is an object detector. Highly viewed photos mostly contain objects(face, animals, text, etc) that are classes that are in the COCO dataset where YOLO was originally trained.
I already tried using pretrained CNN models(VGGNet, MobileNet, etc.) with frozen weights but the results are not good. The option to fine tune the pretrained models are impossible since I don't have the computational resources to train using 100K+ images for x epochs just to create a good model for my problem.