Does the automatic differentiation procedure in TensorFlow compute subgradient whenever needed? If there are many subgradients then which one will be chosen as output?
I am trying to implement the paper in the link https://www.aclweb.org/anthology/P13-1045 which uses recursive neural networks to perform efficient language parsing. The objective function uses hinge loss function to pick the optimal output vectors, which makes the function not differentiable. I used TensorFlow (v1.12) in eager mode to program the model and used the automatic differentiation to compute the gradients. After every batch, I could see the gradient values changing and the accuracy is slightly improved. After a while, it decreases and this process continues. The model does not converge at all for all the hyper-parameter configurations.
Mini batch size : 256, 512, 1024; Regularization parameters - 0.1, 0.01, 0.001; Learning rate - 0.1, 0.01, 0.001; Optimization function - gradient descent, adagrad, adam;
In the paper, they have described how to find subgradient for the optimum function in a very abstract manner, which I have not understood yet. I was of the opinion at the beginning that automatic gradient computation calculates the subgradient. But at this moment, I am starting to doubt so because that seems to be the only variable missing.