I am very new to this field of Deep Learning. While I understand how it works and I managed to run some tutorials on Caffe Library I still have some questions which I was unable to find some satisfying answers.
My questions are as follows:
Consider AlexNet which takes 227 x 227 image size as input in caffe (I think in original paper its 224), and the FC7 produces as 4096-D feature vector. Now if I want to detect a person say using Sliding window of Size (32 x 64) then each of this window will be upsized to 227 x 227 before going through the AlexNet. This is some big computation. Is there a better way to handle this (32 x 64) window?
My approach to this 32 x 64 window detector is to build my own network with few convolutions, pooling, ReLus and FCs. While I understand how I can build the architecture, I am afraid that the model I will train might have issues such as overfitting etc. One of my friend told me to pretrain my network using AlexNet but I don't know how to do this? I cannot get hold of him to ask for now but anyone there who thinks that what he said is doable? I am confused. I was thinking to use ImageNet and train my network which will take 32 x 64 input. Since this is just feature extractor I feel that using the imageNet might provide me with all variety of images for good learning? Please correct me if I am wrong and if possible guide me into the correct path.
This question is just about Caffe. Say I compute feature using HOG and I want to use the GPU version of Neural Net to train a classifier. Is that possible? I want thinking to use HDF5 layer to read the hog feature vector and pass that fully connected layer for training? Is that possible?
I would appreciate any help or links to papers etc that may help me understand the idea of Convnets.