Your code looks like it is using a sparse operation (SparseLengthsSum
), which may lead to issues with the gradient calculation, as some of the gradients will be undefined.
(Context: your previous question "How to ensure that a tensor is in dense representation in caffe2" I previously answered)
That sparse operation essentially groups slices of an input tensor along the first dimension, using another tensor that specifies the lengths of these slices. Then, it reduces each group of slices by summing them.
The issue arises when you try to backpropagate through this operation. Because the SparseLengthsSum
operation involves an aggregation (summing), it does not have a unique gradient: there are many possible sets of gradients that could give the same forward result. In practice, this means that the backward pass needs to "choose" a set of gradients, and the choice made may not correspond to the way you would like your model to learn.
With "Gradient of output .../Transpose is sparse (expected dense)
", it appears the backward pass is expecting a dense gradient (a gradient for every element of the input tensor), but the SparseLengthsSum
operation is only providing a sparse gradient (a gradient for only some elements of the input tensor). That discrepancy between expected and provided gradients is likely causing the error.
See also "The effect of gradient sampling schemes on measures derived from diffusion tensor MRI: A Monte Carlo study" by Derek K. Jones, for more on gradient.
A workaround would be to use the ReduceSum
operator in a loop, iterating over your tensor's elements and accumulating the sum, with ScatterAssign
.
If you need a more efficient and flexible solution, and you are not strictly tied to Caffe2, I would recommend switching to a more modern deep learning framework such as PyTorch, TensorFlow, or JAX, all of which support a cumsum
operation natively and have robust support for autograd.
That workaround can be helpful because it allows you to replicate the cumulative sum behavior, even though Caffe2 does not provide an out-of-the-box operator for that purpose.
In essence, a cumulative sum of a sequence is just the sum of all elements up to the current index. So, by iterating over your tensor's elements and accumulating the sum, you are essentially performing the cumulative sum operation.
Specifically, the ReduceSum
operator returns the sum of all the elements of a tensor. You would typically use this operation on a tensor slice to get the sum of all elements up to the current index, then add this sum to an accumulated tensor.
Again, as mentioned before, this workaround can be quite inefficient, especially for large tensors. It also does not solve the problem of calculating gradients easily, as it involves manually manipulating tensor elements within a loop.