As I understand, the Nvidia tensor cores multiplies two 4x4 matrices and adds the result to a third matrix. Multiplying two 4x4 matrices produces a 4x4 matrix, and adding two 4x4 matrices produces a 4x4 matrix. Still "Each Tensor Core provides a 4x4x4 matrix processing array".
There are 4x multiplication-accumulate operations that are needed for each row*col. I thought maybe the last x4 comes from intermediate result before the accumulation, but I don't think it quite fits with the description on Nvidias pages.
"The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply, as Figure 9 shows." https://developer.nvidia.com/blog/cuda-9-features-revealed/
4x4x4 matrix multiply? I thought matrices was 2dimensions by definition.
Can someone please explain where the last x4 comes from?