Why sigmoid layer gives worse result than tanh layer in 0-1 regression task?

Question

I'm working with regression to predict an array with 0-1 value (array of bit). The neural network specification is the following (MATLAB):

layers = [
    imageInputLayer([1 16 2],'Normalization','none')
    fullyConnectedLayer(512)
    batchNormalizationLayer
    reluLayer

    fullyConnectedLayer(64)
    batchNormalizationLayer
    reluLayer

%     sigmoidLayer
    tanhLayer
    regressionLayer
];

I've used the following code to implement Sigmoid Layer:

classdef sigmoidLayer < nnet.layer.Layer
    methods
        function layer = sigmoidLayer(name) 
            % Set layer name
            if nargin == 2
                layer.Name = name;
            end
            % Set layer description
            layer.Description = 'sigmoidLayer'; 
        end
        function Z = predict(layer,X)
            % Forward input data through the layer and output the result
            Z = exp(X)./(exp(X)+1);
        end
        function dLdX = backward(layer, X ,Z,dLdZ,memory)
            % Backward propagate the derivative of the loss function through 
            % the layer 
            dLdX = Z.*(1-Z) .* dLdZ;
        end
    end
 end

The output is only 0 or 1. So why sigmoid is worse than tanh, instead of equal or better?

score 0 · Answer 1 · answered Oct 04 '19 at 05:09

It depends on what you call "worse". Without more details it's hard to answer clearly.

However one of the key differences is the function's derivative. As the gradient update's magnitude depends on the derivative of the function, it can become close to 0 (and the network can't learn anymore) when the derivative saturates.

The sigmoid saturates at 1 and 0, when x->+/- inf, sigmoid -> 1/0 and d(sigmoid)/dx -> 0 and therefore depending on your data this might cause slower or "worse" learning. On the contrary, though it does saturate when going to 1, tanh does not saturate (actually it's a maxima for its derivative) around 0 so learning in this region is not problematic.

You might also want to look into label smoothing

Thanks. "worse" mean the worse prediction result. The sigmoid makes output to be 0.5-1 instead of saturating at 0. — Le Duong Tuan Anh, Oct 04 '19 at 16:56

Why sigmoid layer gives worse result than tanh layer in 0-1 regression task?

1 Answers1