First up, I suppose you mean the gradient of Output
with respect to the Input
.
Now, the result of both of these calls:
dO = tf.gradients(Output, Input)
dO_i = tf.gradients(Output[i], Input)
(for any valid i
)
will be a list with a single element - a tensor with the same shape as Input
, namely a [num_timesteps, features]
matrix. Also, if you sum all matrices dO_i
(over all valid i
) is exactly the matrix dO
.
With this in mind, back to your question. In many cases, individual rows from the Input
are independent, meaning that Output[i]
is calculated only from Input[i]
and doesn't know other inputs (typical case: batch processing without batchnorm). If that is your case, then dO
is going to give you all individual components dO_i
at once.
This is because each dO_i
matrix is going to look like this:
[[ 0. 0. 0.]
[ 0. 0. 0.]
...
[ 0. 0. 0.]
[ xxx xxx xxx] <- i-th row
[ 0. 0. 0.]
...
[ 0. 0. 0.]]
All rows are going to be 0
, except for the i
-th one. So just by computing one matrix dO
, you can easily get every dO_i
. This is very efficient.
However, if that's not your case and all Output[i]
depend on all inputs, there's no way to extract individual dO_i
just from their sum. You have no other choice other than calculate each gradient separately: just iterate over i
and execute tf.gradients
.