0

As I understand, the Nvidia tensor cores multiplies two 4x4 matrices and adds the result to a third matrix. Multiplying two 4x4 matrices produces a 4x4 matrix, and adding two 4x4 matrices produces a 4x4 matrix. Still "Each Tensor Core provides a 4x4x4 matrix processing array".

There are 4x multiplication-accumulate operations that are needed for each row*col. I thought maybe the last x4 comes from intermediate result before the accumulation, but I don't think it quite fits with the description on Nvidias pages.

"The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply, as Figure 9 shows." https://developer.nvidia.com/blog/cuda-9-features-revealed/

4x4x4 matrix multiply? I thought matrices was 2dimensions by definition.

Can someone please explain where the last x4 comes from?

Alfred
  • 5
  • 4

2 Answers2

1

4x4x4 is just the notation for multiplication of one 4x4 matrix with another 4x4 matrix.

If you were to multiply a 4x8 matrix with a 8x4 matrix, you would have 4x8x4. So if A is NxK and B is KxM, then it can be referred to as a NxKxM matrix multiply.

I just briefly looked up and found this paper, where they use this exact notation (e.g. in Section 4.6 on page 36): https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/153863/eth-6705-01.pdf

Alfred
  • 5
  • 4
M. Steiner
  • 161
  • 7
  • Thanks! I don't have enough reputation to upvote, but ill mark it as accepted answer. – Alfred Jul 12 '22 at 17:21
  • Probably makes more sense in a hardware pipeline than when doing it by hand on paper. Then you normally add things up before you continue to the next. Atleast don't recall seeing it before now. – Alfred Jul 12 '22 at 17:23
0

The cube itself represents the 64 element-wise products required to generate the full 4x4 product matrix" cvw.cac.cornell.edu/GPUarch/tensor_cores. It is the intermediate products before accumulation that make up the last x4.

Alfred
  • 5
  • 4