The following code computes the eigenvalue decomposition of a real symmetric matrix. Then, the gradient of the first eigenvalue with respect to the matrix is computed. This is done three times: 1) using the analytic formula, 2) using TensorFlow, 3) using PyTorch. This yields three different results. Can someone explain this behavior to me?
import numpy as np
import torch
import tensorflow as tf
np.set_printoptions(precision=3)
np.random.seed(123)
# random matrix
matrix_np = np.random.randn(4, 4)
# make symmetric
matrix_np = matrix_np + matrix_np.T
matrix_torch = torch.autograd.Variable(torch.from_numpy(matrix_np), requires_grad=True)
matrix_tf = tf.constant(matrix_np, dtype=tf.float64)
#
# compute eigenvalue decompositions
#
# NumPy
eigvals_np, eigvecs_np = np.linalg.eigh(matrix_np)
# PyTorch
eigvals_torch, eigvecs_torch = torch.symeig(matrix_torch, eigenvectors=True, upper=True)
# TensorFlow
eigvals_tf, eigvecs_tf = tf.linalg.eigh(matrix_tf)
# make sure all three versions computed the same eigenvalues
if not np.allclose(eigvals_np, eigvals_torch.data.numpy()):
print('NumPy and PyTorch have different eigenvalues')
if not np.allclose(eigvals_np, tf.keras.backend.eval(eigvals_tf)):
print('NumPy and TensorFlow have different eigenvalues')
#
# compute derivative of first eigenvalue with respect to the matrix
#
# analytic gradient, see "On differentiating eigenvalues and eigenvectors" by Jan R. Magnus
grad_analytic = np.outer(eigvecs_np[:, 0], eigvecs_np[:, 0])
# PyTorch gradient
eigvals_torch[0].backward()
grad_torch = matrix_torch.grad.numpy()
# TensorFlow gradient
grad_tf = tf.gradients(eigvals_tf[0], matrix_tf)[0]
grad_tf = tf.keras.backend.eval(grad_tf)
#
# print all derivatives
#
print('-'*6, 'analytic gradient', '-'*6)
print(grad_analytic)
print('-'*6, 'Pytorch gradient', '-'*6)
print(grad_torch)
print('-'*6, 'TensorFlow gradient', '-'*6)
print(grad_tf)
Prints
------ analytic gradient ------
[[ 0.312 -0.204 -0.398 -0.12 ]
[-0.204 0.133 0.26 0.079]
[-0.398 0.26 0.509 0.154]
[-0.12 0.079 0.154 0.046]]
------ Pytorch gradient ------
[[ 0.312 -0.407 -0.797 -0.241]
[ 0. 0.133 0.52 0.157]
[ 0. 0. 0.509 0.308]
[ 0. 0. 0. 0.046]]
------ TensorFlow gradient ------
[[ 0.312 0. 0. 0. ]
[-0.407 0.133 0. 0. ]
[-0.797 0.52 0.509 0. ]
[-0.241 0.157 0.308 0.046]]
The main diagonals of the three results are identical. The off-diagonal elements of TensorFlow and PyTorch are twice as large as the analytic elements or equal to zero.
Is this intended behavior? Why is it not documented? Are the gradients wrong?
Version infos: TensorFlow 1.14.0, PyTorch 1.0.1