A deep explanation regarding ReLU and its variant is present in the following links:
- https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/
- https://medium.com/@himanshuxd/activation-functions-sigmoid-relu-leaky-relu-and-softmax-basics-for-neural-networks-and-deep-8d9c70eed91e
In regular ReLU the main drawback is the fact that the input for the activation can be negative, due to operation performed in the network causing to what is referred as "Dying RELU" problem
the gradient is 0 whenever the unit is not active. This could lead to
cases where a unit never activates as a gradient-based optimization
algorithm will not adjust the weights of a unit that never activates
initially. Further, like the vanishing gradients problem, we might
expect learning to be slow when training ReLU networks with constant 0
gradients.
So Leaky ReLU substitutes zero values with some small value say 0.001 (referred as “alpha”). So, for leaky ReLU, the function f(x) = max(0.001x, x). Now gradient descent of 0.001x will be having a non-zero value and it will continue learning without reaching dead end.