0

I profiled training a ResNet-50 on an A100 node and found that only about 10% of the time my GPU kernels are using "Tensor Cores".

I followed the basic instructions for Tensor Core utilization tips and am using batch sizes divisible by 8 along with AMP mixed precision training in Pytorch. Page 45 from these Nvidia docs

In my profile, I see the following kernel related to Batch Norm is the longest running kernel that does not use Tensor Cores. Any idea on how to make this kernel take advantage of the TCs?

void cudnn::bn_bw_1C11_kernel_new<__half, float, float2, 512, true, 1>(float, float, float, float, cudnnTensorStruct, __half const*, cudnnTensorStruct, 
__half const*, cudnnTensorStruct, __half*, float const*, float*, float*, float const*, float const*, float)
CPayne
  • 516
  • 2
  • 5
  • 20

0 Answers0