0

A part of my application uses tensorflow to load the model. Application code is compiled with tensorflow2.3 using devtoolset-7. While trying to run my application binary it crashes while loading libtensorflow_cc.so with stack trace

Illegal instruction.
0x00007ffff3712210 in nsync::nsync_mu_init(nsync::nsync_mu_s_*)


12:56
Program received signal SIGILL, Illegal instruction.
0x00007ffff3712210 in nsync::nsync_mu_init(nsync::nsync_mu_s_*) ()
   from /lib64/libtensorflow_cc.so.2
Missing separate debuginfos, use: debuginfo-install controller-1.0.0-20201014_19_13_07.x86_64
(gdb) bt
#0  0x00007ffff3712210 in nsync::nsync_mu_init(nsync::nsync_mu_s_*) ()
   from /lib64/libtensorflow_cc.so.2
#1  0x00007fffea72df4e in tensorflow::monitoring::Gauge<bool, 0>::Gauge(tensorflow::monitoring::Met
ricDef<(tensorflow::monitoring::MetricKind)0, bool, 0> const&) ()
   from /lib64/libtensorflow_cc.so.2
#2  0x00007fffea72e1f4 in tensorflow::monitoring::Gauge<bool, 0>* tensorflow::monitoring::Gauge<boo
l, 0>::New<char const (&) [39], char const (&) [38]>(char const (&) [39], char const (&) [38]) ()
   from /lib64/libtensorflow_cc.so.2
#3  0x00007fffea3d0f7d in _GLOBAL__sub_I_context.cc () from /lib64/libtensorflow_cc.so.2
#4  0x00007ffff7dea9b3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#5  0x00007ffff7ddc17a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#6  0x0000000000000002 in ?? ()

The flags from /proc/cpuinfo are

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 f ma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpc id_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveop t arat md_clear spec_ctrl intel_stibp arch_capabilities

Can anyone help me out in understanding the issue in this?

  • hi abhay can you run objdump on the binary `objdump -M intel -S [binary] | grep -i mm`. This is to cross check if there calls to avx-512 – Vipin Varghese Oct 19 '20 at 10:40
  • Thanks for the response. I did "objdump -M intel -S /usr/lib64/libtensorflow.so.2 | grep -i mm" but the output is really big for me to paste here. – abhay jogekar Oct 20 '20 at 09:59
  • can you just check if there are avx512 `objdump -M intel -S /usr/lib64/libtensorflow.so.2 | grep -i zmm` in `nsync::nsync_mu_init`? or can you dump the instructions for `nsync::nsync_mu_init` and cross `p $pc` to identify the instructions (or use list). – Vipin Varghese Oct 20 '20 at 10:02
  • are there any updates if there are zmm register used in nsync_mu_init ? were you able to run in gdb and identify the instruction which caused the issue? – Vipin Varghese Oct 21 '20 at 14:04
  • the CPU flags does not list any `avx512*`, there are no `p $PC`and `lsit`executed in GDB to identify what instruction caused `illegal instruction`, Can you share the information. – Vipin Varghese Oct 31 '20 at 02:15
  • @VipinVarghese Thanks for being active on this. The CPU platform of the machine was Intel Broadwell, and as you said it doesn't have avx512* flag. We swapped the machine with Intel Cascade Lake platform and the crash is gone now. – abhay jogekar Nov 04 '20 at 13:56
  • Broadwell does not house avx512 that is zmm register. Hence have requested to check the instruction and init function. Good to hear the problem is resolved. – Vipin Varghese Nov 04 '20 at 14:50

1 Answers1

0

tensor Flow heavily uses AVX instruction on x86 platforms. If the binary is compiled with AVX512 that is zmm registers the binary can run on supporting hardware. Hence as per the comments requested to check the instruction set via

  1. objdump -M intel -S /usr/lib64/libtensorflow.so.2 | grep -i zmm and
  2. print $pc in GDB to isloate the instruction.

Note: as per the update changing from Broadwell (no AVX512) to Skylake (AVX512) has solved the issue.

Vipin Varghese
  • 4,540
  • 2
  • 9
  • 25