tf.matmul() at 35% slower then np.dot() when calculating on the CPU

Question

I compiled tensorflow 1.3 from source files and was unpleasantly surprised by the performance of the product. Considered the comments of the community have managed to reduce the superiority of numpy over tensorflow from 45% to 35% when calculating on the CPU. But still, the difference is huge. Benchmarks code given below:

#! /usr/bin/env python3

import sys
import time
import numpy as np
import tensorflow as tf

print('Python', sys.version)
print('TensorFlow', tf.__version__)

gDType = np.float64
size = 8192

# Numpy calculation
rand_array = np.random.uniform(0, 1, (size, size))
timer0 = time.time()  
res = np.dot(np.dot(rand_array, rand_array), rand_array)
print("numpy multiply: %f" % (time.time() - timer0))


# TensorFlow calculation
x = tf.Variable( tf.random_uniform(shape=(size, size), minval=0, maxval=1, dtype=gDType), dtype=gDType, name='x')
x3 = tf.matmul(tf.matmul(x, x), x)

# Avoid optimizing away redundant nodes
config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))
sess = tf.Session(config=config)
# sess = tf.Session()  
sess.run(tf.global_variables_initializer())

# Exclude delays caused by initialization of the graph
timer0 = time.time()
sess.run(x3.op)
print("tensorflow multiply 1 pass: %f" % (time.time() - timer0))


timer0 = time.time()
sess.run(x3.op)
print("tensorflow multiply 2 pass: %f" % (time.time() - timer0))

Here is the output of the script:

$ ./matmul_benchmark.py 
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609]
TensorFlow 1.3.0
numpy multiply: 37.464786
tensorflow multiply 1 pass: 61.245776
tensorflow multiply 2 pass: 49.944690

The script in the process, consumes 4 GB of RAM and you might want to reduce the size variable to 4096.

The comparison shows the superiority of numpy by 35% (50 sec. / 37 sec.).

Tell me, please, was there any mistake in this test?

PS. My CPU Sandy-bridge flags:

$ lscpu | grep Flags
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable
nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 **sse4_1 sse4_2** popcnt aes
xsave **avx** hypervisor lahf_lm epb xsaveopt dtherm ida arat pln pts

Jeez, I'd hope so. A dot-product produces one number, a matrix multiply produces N*M. — Hans Passant, Sep 13 '17 at 14:41
@HansPassant, this is incorrect. Read https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html. Excerpt: For 2-D arrays it is equivalent to matrix multiplication, and for 1-D arrays to inner product of vectors (without complex conjugation). For N dimensions it is a sum product over the last axis of a and the second-to-last of b: dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m]) — Ilya2567, Sep 13 '17 at 15:01

Yaroslav Bulatov · Answer 1 · 2017-09-13T16:11:26.750

0

First session.run takes longer because it has initialization calls
Are you using optimized numpy ( np.__config__.get_info()) ? Is your TensorFlow compiled with all optimizations? (build -c opt --config=opt)
Numpy and tensorflow manage memory separately, the default behavior of session.run is to copy result into numpy runtime. You could keep all data in TF runtime for lower overhead.

Here's a version that avoids common pitfalls (cuts overhead of needlessly copying the result back into numpy) -- Testing GPU with tensorflow matrix multiplication

For best case scenario I get 11 T ops/sec on GPU, and 1.1 T ops/sec on Xeon V3 (vs something like 0.5 T ops/sec with conda numpy)

edited Sep 13 '17 at 16:11

answered Sep 13 '17 at 15:58

Yaroslav Bulatov

57,332
22
139
197

Thank you, I appreciated your article. The superiority of numpy over tensorflow was reduced from 45% to 35%. But still, the difference is huge. I've included the hotfix in the main post. Is it possible to improve the situation? – Ilya2567 Sep 14 '17 at 08:20
After changing the variable gDType with np.float64 on np.float32 execution time of the script was reduced to 19 s. (260%). Do Tensorflow does not work well with np.float64? – Ilya2567 Sep 14 '17 at 08:52
When I run matmul_bench.py on my machine (Xeon E5-2630 v3 @ 2.40GHz), I get 1 second for TF, and 3.2 seconds for numpy, so TF is 3x faster -- https://github.com/yaroslavvb/stuff/blob/master/matmul_bench.py – Yaroslav Bulatov Sep 14 '17 at 12:55
Could you change the type to np.float64 and repeat the comparison? – Ilya2567 Sep 14 '17 at 13:46
Then it's 5.7 seconds for numpy vs 2.5 seconds for TensorFlow. I'm guessing my numpy distribution isn't properly optimized. If you use MKL-linked TensorFlow vs. MKL-linked numpy, I expect same performance since they will call the same function for the matmul – Yaroslav Bulatov Sep 14 '17 at 13:54
Your script does not control the calculation devices. Can your GPU to help TF? – Ilya2567 Sep 15 '17 at 06:20
Nope, GPU runs 30x faster than numpy. I use env var to turn off GPU – Yaroslav Bulatov Sep 15 '17 at 13:45

score 0 · Accepted Answer · answered Sep 13 '17 at 16:16

0

Intel has added optimizations to TensorFlow for Xeon and Xeon Phi through the use of Math Kernel Library for Deep Neural Networks. When compiling the tensorflow 1.3+, you should consider adding --config=mkl option during the compile. It only supports Linux OS though. Not sure how much speedup it will give you for benchmarking you are doing.

Some numpy distribution already includes MKL support. For example, Anaconda versions 2.5 and later, MKL is available by default.

answered Sep 13 '17 at 16:16

Lan

6,470
3
26
37

I propose to look at an excerpt from the [article](https://www.tensorflow.org/install/install_sources#ConfigureInstallation): "Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native] This question refers to a later phase in which you'll use bazel to build the pip package. We recommend accepting the default (-march=native), which will optimize the generated code for your local machine's CPU type." - I believe the specified optimizations is enabled. Thanks for the help. – Ilya2567 Sep 14 '17 at 06:29
You were right, option --config=mkl is defined separately. After recompiling the performance of the TF and NP was the same. – Ilya2567 Sep 21 '17 at 09:09

tf.matmul() at 35% slower then np.dot() when calculating on the CPU

2 Answers2