62

I am completely new to terms related to HPC computing, but I just saw that EC2 released its new type of instance on AWS that's powered by the new Nvidia Tesla V100, which has both kinds of "cores": Cuda Cores (5,120) and Tensor Cores (640). What is the difference between both?

Aayush
  • 220
  • 1
  • 11

5 Answers5

85

Now only Tesla V100 and Titan V have tensor cores. Both GPUs have 5120 cuda cores where each core can perform up to 1 single precision multiply-accumulate operation (e.g. in fp32: x += y * z) per 1 GPU clock (e.g. Tesla V100 PCIe frequency is 1.38Gz).

Each tensor core perform operations on small matrices with size 4x4. Each tensor core can perform 1 matrix multiply-accumulate operation per 1 GPU clock. It multiplies two fp16 matrices 4x4 and adds the multiplication product fp32 matrix (size: 4x4) to accumulator (that is also fp32 4x4 matrix).

It is called mixed precision because input matrices are fp16 but multiplication result and accumulator are fp32 matrices.

Probably, the proper name would be just 4x4 matrix cores however NVIDIA marketing team decided to use "tensor cores".

Artur
  • 874
  • 6
  • 4
  • 17
    time to update this answer - [Nvidia's Turing architecture](https://nvidianews.nvidia.com/news/nvidia-reinvents-computer-graphics-with-turing-architecture) just got released – Brett Holman Aug 14 '18 at 15:48
26

GPU’s have always been good for machine learning. GPU cores were originally designed for physics and graphics computation, which involves matrix operations. General computing tasks do not require lots of matrix operations, so CPU’s are much slower at these. Physics and graphics are also far easier to parallelise than general computing tasks, leading to the high core count.

Due to the matrix heavy nature of machine learning (neural nets), GPU’s were a great fit. Tensor cores are just more heavily specialised to the types of computation involved in machine learning software (such as Tensorflow).

Nvidia have written a detailed blog here, which goes into far more detail on how Tensor cores work and the preformance improvements over CUDA cores.

MikeS159
  • 1,884
  • 3
  • 29
  • 54
14

CUDA cores:

Does a single value multiplication per one GPU clock

1 x 1 per GPU clock

TENSOR cores:

Does a matrix multiplication per one GPU clock

[1 1 1       [1 1 1
 1 1 1   x    1 1 1    per GPU clock
 1 1 1]       1 1 1]

To be more precise TENSOR core does the computation of many CUDA cores in the same time.

1

Most of the Deep Learning Neural Network computations are Matrix Multiplications. So, tensor cores were introduced by NVIDIA to do these matrix multiplications efficiently. Matrix and Tensor are both same and are multi dimensional arrays.

CUDA core - 1 single precision multiplication(fp32) and accumulate per clock.

Tensor core - 64 fp16 multiply accumulate to fp32 output per clock.

But main difference is CUDA cores don't compromise on precision. Tensor cores by taking fp16 input are compromising a bit on precision. So, that is why tensor cores are used for mixed precision training. Training still in floating point, but inputs are in fp16 and outputs are in fp32.

NVIDIA claims with limited loss of accuracy they are able to achieve 4x-8x speed in training with tensor cores.

So, its all a trade off.

Kartik Podugu
  • 144
  • 1
  • 5
  • 1
    The number of FP16 operations per Tensor core is dependent on the Tensor Core version (currently version 1 to version 4). – Sebastian Dec 01 '22 at 15:01
0

Tensor cores use a lot less computation power at the expense of precision than Cuda Cores, but that loss of precision doesn't have that much effect on the final output.

This is why for Machine Learning models, Tensor Cores are more effective at cost reduction without changing the output that much.

Google itself uses the Tensor Processing Units for google translate.

pranshu vinayak
  • 133
  • 1
  • 8
  • 17
    Misleading answer. Google's TPU and nvidia's Tensor Core have nothing in common. – bct Jul 07 '19 at 16:05