8

As many machine learning algorithms rely to matrix multiplication(or at least can be implemented using matrix multiplication) to test my GPU is I plan to create matrices a , b , multiply them and record time it takes for computation to complete.

Here is code that will generate two matrices of dimensions 300000,20000 and multiply them :

import tensorflow as tf
import numpy as np

init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)


#a = np.array([[1, 2, 3], [4, 5, 6]])
#b = np.array([1, 2, 3])

a = np.random.rand(300000,20000)
b = np.random.rand(300000,20000)

println("Init complete");

result = tf.mul(a , b)
v = sess.run(result) 

print(v)

Is this a sufficient test to compare performance of GPU's ? What other factors should I consider ?

blue-sky
  • 51,962
  • 152
  • 427
  • 752

1 Answers1

16

Here's an example of a matmul benchmark which avoids common pitfalls, and matches the official 11 TFLOP mark on Titan X Pascal.

import os
import sys
os.environ["CUDA_VISIBLE_DEVICES"]="1"
import tensorflow as tf
import time

n = 8192
dtype = tf.float32
with tf.device("/gpu:0"):
    matrix1 = tf.Variable(tf.ones((n, n), dtype=dtype))
    matrix2 = tf.Variable(tf.ones((n, n), dtype=dtype))
    product = tf.matmul(matrix1, matrix2)


# avoid optimizing away redundant nodes
config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))
sess = tf.Session(config=config)

sess.run(tf.global_variables_initializer())
iters = 10

# pre-warming
sess.run(product.op)

start = time.time()
for i in range(iters):
  sess.run(product.op)
end = time.time()
ops = n**3 + (n-1)*n**2 # n^2*(n-1) additions, n^3 multiplications
elapsed = (end - start)
rate = iters*ops/elapsed/10**9
print('\n %d x %d matmul took: %.2f sec, %.2f G ops/sec' % (n, n,
                                                            elapsed/iters,
                                                            rate,))
Yaroslav Bulatov
  • 57,332
  • 22
  • 139
  • 197
  • cool, I think should post your code within your answer in addition to referencing the code off-site. – blue-sky Jan 24 '17 at 16:53
  • GPU wasn't discovered unless `os.environ["CUDA_VISIBLE_DEVICES"]="1"` was commented out. Works with Windows 10, tensorflow-gpu (1.4), cuda_8.0.61_win10 and cudnn-8.0-windows10-x64-v6.0. – BSalita Dec 18 '17 at 17:23
  • Error was `Cannot assign a device for operation 'Variable_1': Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.` – BSalita Dec 18 '17 at 17:29
  • `cuda_8.0.61_win10` was downloaded from https://developer.nvidia.com/cuda-toolkit-archive. `cudnn-8.0-windows10-x64-v6.0` was downloaded from https://developer.nvidia.com/rdp/cudnn-download. – BSalita Dec 18 '17 at 17:31
  • 1
    This test correctly shows gpu performance. My 1050 Ti got 2.3 TFlops. Which is almost totally correct. – Yeasin Ar Rahman Aug 21 '18 at 18:31