2

How can I use tensorflow to do convolution using fp16 on GPU? (the python api using __half or Eigen::half).

I want to test a model with fp16 on tensorflow, but I got stucked. Actually, I found that fp16 convolution in tensorflow seems like casting the fp32 convolution's result into fp16, which is not what I need.

I tried to give the tf.nn.conv2d a fp16 input in fp16 format, and give the tf.nn.conv2d a fp16 input in fp32 format (tf.cast it into fp32) then tf.cast the result into fp16, and they gave exactly the same result. But as I think, doing convolution in fp16 is different from doing it in fp32 and then cast it into fp16, am I wrong? Please help me, thanks.

environment:
ubuntu 16.04
tensorflow 1.9.0
cuda 9.0
Tesla V100
import tensorflow as tf
import numpy as np
import os

def conv16_32(input, kernel): # fake fp16 convolution
    input = tf.cast(input, tf.float16)
    kernel = tf.cast(kernel, tf.float16)
    input = tf.cast(input, tf.float32)
    kernel = tf.cast(kernel, tf.float32)
    out = tf.nn.conv2d(input, kernel, [1,1,1,1], padding='VALID')
    out = tf.cast(out, tf.float16)
    out = tf.cast(out, tf.float64)
    return out

def conv16(input, kernel): # real fp16 convolution
    input = tf.cast(input, tf.float16)
    kernel = tf.cast(kernel, tf.float16)
    out = tf.nn.conv2d(input, kernel, [1,1,1,1], padding='VALID')
    out = tf.cast(out, tf.float64)
    return out

x = np.random.rand(16, 32, 32, 16).astype('float64')
w = np.random.rand(3, 3, 16, 16).astype('float64')
x = tf.get_variable('input', dtype=tf.float64, initializer=x)
w = tf.get_variable('weight', dtype=tf.float64, initializer=w)

out_16 = conv16(x, w)
out_16_32 = conv16_32(x, w)

os.environ['CUDA_VISIBLE_DEVICES'] = '1'
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config = config)
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
print(sess.run(tf.reduce_max(out_16_32 - out_16)))

The above two functions give the same result, say the final 'print' result is zero.

The result of fp16 convolution and fp32 convolution should not be same (in my point of view). How can I use tensorflow to do convolution using real fp16 on GPU? (the python api using __half or Eigen::half)

talonmies
  • 70,661
  • 34
  • 192
  • 269
Di Huang
  • 63
  • 8
  • What's more, when I set tensorflow to work on CPU instead of GPU, the results are still the same, so I think that I am using tensorflow fp16 in a wrong way. – Di Huang Aug 21 '19 at 12:03
  • Thanks for the clarifications, now the post is complete and you have a legit concern/question. – Ander Biguri Aug 21 '19 at 12:48
  • May be an unexpected result of small convolutions. Tested again using pytorch. – Di Huang Aug 22 '19 at 09:59
  • 1
    Just as a comment, `CUDA_VISIBLE_DEVICES` will only have any effect on TensorFlow if you change it _before_ importing it. – jdehesa Aug 22 '19 at 09:59

2 Answers2

0

I think you are using the operations correctly. In your example, you can check that the convolution operations do indeed have the right type.

conv2d_op_16 = out_16.op.inputs[0].op
print(conv2d_op_16.name, conv2d_op_16.type, conv2d_op_16.get_attr('T'))
# Conv2D Conv2D <dtype: 'float16'>
conv2d_op_16_32 = out_16_32.op.inputs[0].op.inputs[0].op
print(conv2d_op_16_32.name, conv2d_op_16_32.type, conv2d_op_16_32.get_attr('T'))
# Conv2D_1 Conv2D <dtype: 'float32'>

And TensorFlow does register kernels for fp16 for CPU and for GPU, so there is no reason to think is doing anything else. I don't have a lot of experience with fp16, so I'm not sure if the zero difference is "normal", but there does not seem to be any way in which conv16 is using anything other than a fp16 convolution.

jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • The same on pytorch. May be something about FMA in cuda fp16 implementation. Anyway, I think that cuda optimizes the calculation using fp16, and makes its intermediate results almost full precision. Thanks for answering. – Di Huang Aug 31 '19 at 04:14
  • If you construct a simple convolution example that should overflow in f16 and execute it as f16 you find that it doesn't overflow, meaning that the underlying arithmetic is indeed performed in f32 and only the result is converted to f16. – Szabolcs Oct 29 '19 at 14:33
  • @Szabolcs I see what you mean, thanks for the example. I may be wrong, but I think, however, that from the point of view of TensorFlow everything is fp16. I don't know if there is a lower level of the stack where Eigen or CUDA are switching to fp32 at some point in the calculation for some reason... For example, some GPUs do not support fp16, or are slower at it (see [here](https://devtalk.nvidia.com/default/topic/1023708/gpu-accelerated-libraries/fp16-support-on-gtx-1060-and-1080/)). – jdehesa Oct 30 '19 at 11:35
  • @jdehesa I believe that's the case, somebody working on Tensorflow decided to silently convert to f32 for performance or compatibility reasons even when the type of both convolution inputs are f16. I don't think this was a good idea, I'd like to test the accuracy of f16 convolutions and now I can't do that. – Szabolcs Oct 31 '19 at 13:53
0

I'm trying to figure out the same. Here is some simple code that you can test convolutions with:

import tensorflow as tf
tf.enable_eager_execution()
input = tf.cast([[[[65519], [65519], [65519], [65519]]]], tf.float16) #BHWC
filter = tf.cast([[[[65519]], [[-65519]]]], tf.float16) #HWIO
tf.print(tf.nn.conv2d(input, filter, [1,1,1,1], "VALID"))

This should overflow if the convolutions are done in fp16, but doesn't actually overflow in Tensorflow. The result I get is [[[[0][0][0]]]], which suggest that convolutions are performed in fp32.

Edit: The solution is to set the environment variable:

TF_FP16_CONV_USE_FP32_COMPUTE=0

This gives the result [[[[inf][inf][inf]]]], suggesting that this time the convolution is performed in fp16. It seems you need at least a 10x0 GPU for this.

Szabolcs
  • 832
  • 11
  • 15