I am currently trying to implement the network presented in 1 using Keras. The model can be divided into two parts: the first one, called the super pixels depth network, regresses depth values for an image segmented in super pixels while the second one is a continuous conditional random field layer whose goal is to ensure spatial and temporal consistency in the depth maps predicted for each frame of a video. Each part has its set of parameters, W for the super pixel network and alpha_s & alpha_t for the CRF. Both are trained jointly using stochastic gradient descent.
I had no problem implementing the super pixels depth network, it’s just the first 31 layers of VGG-16 followed by a super pixels pooling layer. The tricky part for me is the CRF layer and the following loss function:
In this equation:
- I is a frame from a video.
- d is a Nx1 vector of ground truth super pixels depth where N is the number of super pixels in the video.
- L is a NxN matrix depending on alpha_s, alpha_t and the spatio-temporal relations between super pixels given by the segmentation algorithm.
- Z is a Nx1 vector of estimated super pixels depth depending on W.
If I am not mistaken, this loss function does not have the conventional form “loss(y_true, y_pred)" and uses y_true (d in the expression) and the network weights (in z and L). My idea is to use the functional API and write a custom loss function that computes -log P( d | I ) but I am wondering how the gradients will be computed? I read here that Keras is taking care of gradients computation as long as we are only using existing operations but matrix inversion is not in the tensorflow backend. Will it work if I call tf.matrix_inverse?
Best regards, Ambroise
1 Zhao, Xu-Ran, Xun Wang, and Qi-Chao Chen. "Temporally Consistent Depth Map Prediction Using Deep Convolutional Neural Network and Spatial-Temporal Conditional Random Field." Journal of Computer Science and Technology 32.3 (2017): 443-456. link