Why do TensorFlow and PyTorch gradients of the eigenvalue decomposition differ from each other and the analytic solution?

Question

The following code computes the eigenvalue decomposition of a real symmetric matrix. Then, the gradient of the first eigenvalue with respect to the matrix is computed. This is done three times: 1) using the analytic formula, 2) using TensorFlow, 3) using PyTorch. This yields three different results. Can someone explain this behavior to me?

import numpy as np
import torch
import tensorflow as tf


np.set_printoptions(precision=3)
np.random.seed(123)

# random matrix
matrix_np = np.random.randn(4, 4)
# make symmetric
matrix_np = matrix_np + matrix_np.T
matrix_torch = torch.autograd.Variable(torch.from_numpy(matrix_np), requires_grad=True)
matrix_tf = tf.constant(matrix_np, dtype=tf.float64)

#
# compute eigenvalue decompositions
#
# NumPy
eigvals_np, eigvecs_np = np.linalg.eigh(matrix_np)
# PyTorch
eigvals_torch, eigvecs_torch = torch.symeig(matrix_torch, eigenvectors=True, upper=True)
# TensorFlow
eigvals_tf, eigvecs_tf = tf.linalg.eigh(matrix_tf)

# make sure all three versions computed the same eigenvalues
if not np.allclose(eigvals_np, eigvals_torch.data.numpy()):
    print('NumPy and PyTorch have different eigenvalues')
if not np.allclose(eigvals_np, tf.keras.backend.eval(eigvals_tf)):
    print('NumPy and TensorFlow have different eigenvalues')

#
# compute derivative of first eigenvalue with respect to the matrix
#
# analytic gradient, see "On differentiating eigenvalues and eigenvectors" by Jan R. Magnus
grad_analytic = np.outer(eigvecs_np[:, 0], eigvecs_np[:, 0])
# PyTorch gradient
eigvals_torch[0].backward()
grad_torch = matrix_torch.grad.numpy()
# TensorFlow gradient
grad_tf = tf.gradients(eigvals_tf[0], matrix_tf)[0]
grad_tf = tf.keras.backend.eval(grad_tf)

#
# print all derivatives
#
print('-'*6, 'analytic gradient', '-'*6)
print(grad_analytic)
print('-'*6, 'Pytorch gradient', '-'*6)
print(grad_torch)
print('-'*6, 'TensorFlow gradient', '-'*6)
print(grad_tf)

Prints

------ analytic gradient ------
[[ 0.312 -0.204 -0.398 -0.12 ]
 [-0.204  0.133  0.26   0.079]
 [-0.398  0.26   0.509  0.154]
 [-0.12   0.079  0.154  0.046]]
------ Pytorch gradient ------
[[ 0.312 -0.407 -0.797 -0.241]
 [ 0.     0.133  0.52   0.157]
 [ 0.     0.     0.509  0.308]
 [ 0.     0.     0.     0.046]]
------ TensorFlow gradient ------
[[ 0.312  0.     0.     0.   ]
 [-0.407  0.133  0.     0.   ]
 [-0.797  0.52   0.509  0.   ]
 [-0.241  0.157  0.308  0.046]]

The main diagonals of the three results are identical. The off-diagonal elements of TensorFlow and PyTorch are twice as large as the analytic elements or equal to zero.

Is this intended behavior? Why is it not documented? Are the gradients wrong?

Version infos: TensorFlow 1.14.0, PyTorch 1.0.1

FYI: I ran your code using Ppytorch 1.3 and `matrix_torch.grad.numpy()` is the same as the analytic gradient for me. — hdkrgr, Nov 14 '19 at 12:16
Thank you! I found this issue https://github.com/pytorch/pytorch/pull/23018 Apparently, the lower triangular matrix was indeed a bug and PyTorch fixed it. So it's probably a bug in TensorFlow as well. — Maryks, Nov 14 '19 at 12:49

score 0 · Answer 1 · answered Sep 02 '20 at 02:23

The gradient with respect to a matrix that's guaranteed to be symmetric isn't really well-defined (off the diagonal), since a valid implementation can depend on either an element or its opposite (or a weighted sum of the two).

For example, a valid implementation of a function that sums the elements of a 2x2 symmetric matrices x would be

f(x) = x[0][0]+x[0][1]+x[1][0]+x[1][1]

but another valid implementation would be

f(x) = x[0][0]+x[1][1]+2*x[0][1]

If the symmetric matrix is part of a larger calculation that guarantees the matrix is always symmetric (e.g. x = [[a, b], [b, c]], where a, b and c are some scalars), then the gradient of the larger calculation is unaffected by exactly how you define the gradient of the function-of-symmetric-matrix (in the example I'm running with here, we'd have df/da = df/dc = 1 and df/db = 2 no matter how you define f).

That said, the symmetric gradient is a nice choice (as explained in the PyTorch PR linked in the comment) because it means that if you happen to be doing gradient descent updates on a symmetric matrix, the matrix stays symmetric.

Also, note that TensorFlow does document that only the lower triangular part of the matrix is used for the calculation, and deliberately adjusts the reported gradient accordingly.

Thank you! Someone made me aware of this paper: https://arxiv.org/pdf/1911.06491.pdf The paper discusses that there exist two notions of the gradient of a scalar function of a symmetric matrix - but one of them is wrong, as the authors demonstrate and prove. According to the authors, the wrong definition is mainly used by statisticians and electrical engineers. In particular, as I noticed, the formula which one can find in the famous matrix cookbook is not correct. — Maryks, Sep 03 '20 at 09:22

Why do TensorFlow and PyTorch gradients of the eigenvalue decomposition differ from each other and the analytic solution?

1 Answers1