3

I am a Tensorflow enthusiast and I am trying to export a model (developed in Python and then frozen and optimized with the Tensorflow tools) for the usage (just for inference) within a C++ project. What I have experienced is that, even following all the prescriptions found in other issues opened already by other users, the C++ executable I obtain after compiling the source is much slower in the inference operation (I mean session->run) by a factor of 10 compared to the same operation in a Python inference code.

I am aware of different issues opened on this topic. Following those I built the C++ project using the following command:

bazel build -c opt  --copt=-mfma --copt=-mfpmath=both //tensorflow/project:project

I tried also to use the same batch size for the inference tensor as used for training, but I still experience the same worsening of magnitude 10 in time performance for the session->run operation.

I am aware of the fact that in principle, C++ implementation should be faster than Python's one (just because Python is higher level than C++), so this effect is in my opinion counterintuitive. My question is whether I am doing something wrong or this is just a feature of Tensorflow.

Another question: googling around the web, I could find out that freezing graphs has the effect of slowing down the inference process (I might be wrong on that), but I couldn't figure out an alternative way of loading a graph within a C++ code instead of the frozen one (anyway, freezing or not the graph has no effect on Python's performance). Perhaps somebody could also explain whether other options are available at the moment.

Thank you very much in advance for all your kind suggestion and thank you for the outstanding job with Tensorflow.

Striezel
  • 3,693
  • 7
  • 23
  • 37
karakorum
  • 79
  • 1
  • 8
  • [Python isn't always as slow as you think](https://thepcspy.com/read/python-isnt-slower-than-c/) – Bailey Parker Jan 22 '18 at 14:50
  • 1
    It's technically not about languages being fast or slow, but about how much stuff they do. Python generally uses dynamic types which tends to lead to a lot of type lookups while in C++ you get static types without type lookups by default. The difference is not in how high-level the language is, but in how much they do. And in this case C++ is doing something that Python doesn't. Figure out what that is and how to get rid of it. – nwp Jan 22 '18 at 14:54
  • did you try `--copt=-msse4.2 --copt=-mavx2` ? – aram Jan 22 '18 at 14:59
  • Yes, I did try also that, it has no effect.. – karakorum Jan 22 '18 at 15:16
  • 10 times sounds like a lot... Do you have a GPU that the python version uses but not the C++ one ? – gdelab Jan 22 '18 at 15:29
  • I inserted os.environ["CUDA_VISIBLE_DEVICES"]="0" ad the beginning of the python code. For infering the time I use time.time() in python and std::chrono::steady_clock for C++ one. I get 0.02 for python and 0.3 for C++ – karakorum Jan 22 '18 at 15:33
  • Profile. Profiling will help you identify where your time bottlenecks are. Please edit your post with the C++ code fragment that is occupying the most time. – Thomas Matthews Jan 22 '18 at 15:39
  • If your C++ code works correctly, you may want to post to [codereview.se] for a code inspection. Add to your post that you are looking for speed optimizations and identify the bottleneck area of code. – Thomas Matthews Jan 22 '18 at 15:41
  • It seems to be in Eigen::internal::gemm_pack_rhs and in Eigen::internal::gebp_kernel.. these are third party c++ implementations, I cannot modify those.. – karakorum Jan 22 '18 at 16:10
  • Does it improve when passing `--copt=-O3`? – Paul Belanger Jan 22 '18 at 20:15
  • I have just tried. Nor effect nor improvement. Thank you in any case for the suggestion – karakorum Jan 23 '18 at 07:53

1 Answers1

1

I figure out the problem is related to the frozen graph. In python I found out that I was using the checkpoint-saved model for the python case while I used the frozen one for the c++ code (my mistake, sorry). It seems anyway that freezing the graph is dramatically slowing down the inference process. After switching to the frozen model in C++ fashion, also the python inference code needs the same C++ inference time.

Stephen Kennedy
  • 20,585
  • 22
  • 95
  • 108
karakorum
  • 79
  • 1
  • 8
  • By the way, what is the final status on speed? frozen graph vs checkpoints in C++ vs Python? – Sathyamoorthy R Feb 11 '19 at 06:54
  • 2
    I got that using checkpoints is faster, but then I also got informed of the optimize_for_inferecne script. This helped in terms of performance. – karakorum Feb 12 '19 at 07:17