Food101 SqueezeNet Caffe2 number of iterations

Question

I am trying to classify the ETH Food-101 dataset using squeezenet in Caffe2. My model is imported from the Model Zoo and I made two types of modifications to the model:

1) Changing the dimensions of the last layer to have 101 outputs

2) The images from the database are in NHWC form and I just flipped the dimensions of the weights to match. (I plan on changing this)

The Food101 dataset has 75,000 images for training and I am currently using a batch size of 128 and a starting learning rate of -0.01 with a gamma of 0.999 and stepsize of 1. What I noticed is that for the first 2000 iterations of the network the accuracy hovered around 1/128 and this took an hour or so to complete.

I added all the weights to the model.params so they can get updated during gradient descent(except for data) and reinitialized all weights as Xavier and biases to constant. I would expect the accuracy to grow fairly quickly in the first hundred to thousand iterations and then tail off as the number of iterations grow. In my case, the learning is staying constant around 0.

When I look at the gradient file I find that the average is on the order of 10^-6 with a standard deviation of 10^-7. This explains the slow learning rate, but I haven't been able to get the gradient to start much higher.

These are the gradient statistics for the first convolution after a few iterations

    Min        Max          Avg       Sdev
-1.69821e-05 2.10922e-05 1.52149e-06 5.7707e-06
-1.60263e-05 2.01478e-05 1.49323e-06 5.41754e-06
-1.62501e-05 1.97764e-05 1.49046e-06 5.2904e-06
-1.64293e-05 1.90508e-05 1.45681e-06 5.22742e-06

Here are the core parts of my code:

#init_path is path to init_net protobuf 
#pred_path is path to pred_net protobuf
def main(init_path, pred_path):
    ws.ResetWorkspace()
    data_folder = '/home/myhome/food101/'
    #some debug code here
    arg_scope = {"order":"NCHW"}
    train_model = model_helper.ModelHelper(name="food101_train", arg_scope=arg_scope)
    if not debug:
            data, label = AddInput(
                    train_model, batch_size=128,
                    db=os.path.join(data_folder, 'food101-train-nchw-leveldb'),
                    db_type='leveldb')
    init_net_def, pred_net_def = update_squeeze_net(init_path, pred_path)
    #print str(init_net_def)
    train_model.param_init_net.AppendNet(core.Net(init_net_def))
    train_model.net.AppendNet(core.Net(pred_net_def))
    ws.RunNetOnce(train_model.param_init_net)
    add_params(train_model, init_net_def)
    AddTrainingOperators(train_model, 'softmaxout', 'label')
    AddBookkeepingOperators(train_model)

    ws.RunNetOnce(train_model.param_init_net)
    if debug:
            ws.FeedBlob('data', data)
            ws.FeedBlob('label', label)
    ws.CreateNet(train_model.net)

    total_iters = 10000
    accuracy = np.zeros(total_iters)
    loss = np.zeros(total_iters)
    # Now, we will manually run the network for 200 iterations.
    for i in range(total_iters):
            #try:
            conv1_w = ws.FetchBlob('conv1_w')
            print conv1_w[0][0]
            ws.RunNet("food101_train")
            #except RuntimeError:
            #       print ws.FetchBlob('conv1').shape
            #       print ws.FetchBlob('pool1').shape
            #       print ws.FetchBlob('fire2/squeeze1x1_w').shape
            #       print ws.FetchBlob('fire2/squeeze1x1_b').shape
            #softmax = ws.FetchBlob('softmaxout')
            #print softmax[i]
            #print softmax[i][0][0]
            #print softmax[i][0][:5]
            #print softmax[64*i]
            accuracy[i] = ws.FetchBlob('accuracy')
            loss[i] = ws.FetchBlob('loss')
            print accuracy[i], loss[i]

My add_params function initializes the weights as follows

#ops allows me to only initialize the weights of specific ops because i initially was going to do last layer training
def add_params(model, init_net_def, ops=[]):
    def add_param(op):
            for output in op.output:
                    if "_w" in output:
                            weight_shape = []
                            for arg in op.arg:
                                    if arg.name == 'shape':
                                            weight_shape = arg.ints
                            weight_initializer = initializers.update_initializer(
                                                    None,
                                                    None,
                                                    ("XavierFill", {}))
                            model.create_param(
                                    param_name=output,
                                    shape=weight_shape,
                                    initializer=weight_initializer,
                                    tags=ParameterTags.WEIGHT)
                    elif "_b" in output:
                            weight_shape = []
                            for arg in op.arg:
                                    if arg.name == 'shape':
                                            weight_shape = arg.ints
                            weight_initializer = initializers.update_initializer(
                                                    None,
                                                    None,
                                                    ("ConstantFill", {}))
                            model.create_param(
                                    param_name=output,
                                    shape=weight_shape,
                                    initializer=weight_initializer,

I find that my loss function fluctuates when I use the full training set, but If i use just one batch and iterate over it several times I find that the loss function goes down but very slowly.

score 1 · Accepted Answer · answered Jun 12 '17 at 23:23

While SqueezeNet has 50x fewer parameters than AlexNet, it is still a very large network. The original paper does not mention a training time, but the SqueezeNet-based SQ required 22 hours to train using two Titan X graphics cards - and that was with some of the weights pre-trained! I haven't gone over your code in detail, but what you describe is expected behavior - your network is able to learn on the single batch, just not as quickly as you expected.

I suggest reusing as many of the weights as possible instead of reinitializing them, just as the creators of SQ did. This is known as transfer learning, and it works because many of the lower-level features (lines, curves, basic shapes) in an image are the same regardless of the image's content, and reusing the weights for these layers saves the network from having to re-learn them from scratch.

Thanks for the answer Jeff. The problem turned out to be with the way I was computing gradient descent. I had copied their tutorial's weighted sum descent, but after reading some more, it seems like for larger networks a more sophisticated form of descent is needed. Switching to Adam made all the difference and you last layer training suggestion really helped speed up the training. Thanks! — Shaun, Jun 17 '17 at 18:18

Food101 SqueezeNet Caffe2 number of iterations

1 Answers1

Linked