0

We've been trying to track down some strange behavior in our large legacy app for the past few months. It suffers from random and occasional memory corruption, vtable corruption (??), and other strange and random behavior like infinite loops in std::rbl_tree, std::map, and even std::string s = "abcd".

The target machine is 32-bit Centos 6 so we started with the built-in g++ 4.4 but address sanitizer wasn't available, so moved to 4.8 in devtoolset-2 and now have compiled gcc 4.9 from source.

Valgrind (v3.8.1) doesn't work with any of them - it gives unhandled instruction bytes: 0xC5 0xF9 0x6E 0xC5

AddressSanitizer reports different errors between gcc 4.8 and 4.9, and often reports bogus global-buffer-overflow errors like

Error reading 0x080e3ad4 where 0x080e3ad4 is located 0 bytes to the right of global variable 'x'

Except that 0x080e3ad4 is the address of x!

Worse yet, the erroneous behavior is different between gcc versions.

I've read a lot of posts comparing the speed advantages of gcc v5, 6, 7, but nothing about stability of the code they produce.

Since it takes a lot of effort to build a new gcc and reconfigure our large app to build with it, and test & package the distribution for customers, here's the question:

  • What improvement in stability come with a newer gcc?
  • How has AddressSanitizer been improved?
  • Can we expect Valgrind to not barf out on illegal op codes?

All suggestions appreciated. Its been months on this problem.

EDIT

To be clear, I'd like to get instrumentation that works (eg AddressSanitizer, Valgrind or others) so we can fix the underlying behavior. Those debug tools working or not is tied to the compiler environment rather than our C++ program. (eg. There is no assembler in the program so Valgrind should understand the instructions from the compiler and/or the compiler should emit opcodes that are understood by Valgrind, as one example.

EDIT II

G++ compiler flags (for debug compile)

/usr/local/gcc-4.9.4/bin/c++4.9 -c -std=c++98  -m32 -g -ggdb -O0 
-Wall -Wextra -Wno-sign-compare -Wcast-align -fdiagnostics-color=auto  
-ftemplate-depth=32 -march=native -fPIC -o xx.o xx.cpp

Target hardware: i5-6500

Danny
  • 2,482
  • 3
  • 34
  • 48
  • 6
    This sounds like you should spend time finding and fixing UB instead of finding a compiler that happens to produce behavior you like. – nwp Aug 31 '17 at 11:37
  • afaik there is no improvement in stability with a newer gcc, if your code is broken you cant blame the compiler – 463035818_is_not_an_ai Aug 31 '17 at 11:38
  • Yup, I'd like to... but hard with `AddressSanitizer` and `Valgrind` not working, so don't know what other tools we can use to debug the problem. – Danny Aug 31 '17 at 11:38
  • So after you build with different gcc version than on target system, did you deploy it properly ensuring that everything is binary compatible? – user7860670 Aug 31 '17 at 11:39
  • I don't know if I'd call either of them unstable per se. I could have to do with linking against libraries compiled for other architectures. While you have asked answerable questions, I maintain a healthy suspicion that this is still an XY problem. I'm betting these issues disappeared when you tried to make a small reproducible example (and if you haven't, then do that) – AndyG Aug 31 '17 at 11:39
  • VTT, the app is self contained with all of its own libraries built against the same gcc. But... if the libraries weren't binary compatible, wouldn't they not load at all? The app runs properly for hours before (maybe) having trouble. It is also spread across multiple programs communicating over unix domain sockets making it harder. The main app has one executable which loads 45 shared libraries we wrote as well as the ACE/TAO CORBA ORB... – Danny Aug 31 '17 at 11:46
  • "Can we expect Valgrind to not barf out on illegal op codes?" The opcode itself is perfectly legal (`C5 F9 6E C5` is `vmovd xmm0, ebp`), but 32-bit valgrind doesn't support AVX and is pretty much deprecated by upstream. –  Aug 31 '17 at 11:59
  • Infinite loops and other strange things can potentially be caused by heap corruption, so that is likely the root of the problem. Im not sure of any Linux tools that help with this if your 32bit and using AVX as Fanael said. If possible you may want to try 64bit, allthough if the legacy code is 32bit only that might be even more trouble. Almost certainly not a GCC (or glibc etc.) issue, its more likely some of your code is writing an invalid pointer or overflowing an array somewhere. – Fire Lancer Aug 31 '17 at 12:05
  • Fanael, do you mean don't bother trying `Valgrind` anymore because of the lack of AVX support on 32-bits? Porting our app to 64-bit is not an option. – Danny Aug 31 '17 at 12:08
  • Binary compatibility could be very a tricky problem. And from my experience approach with supplying software compiled with different compiler version than is used on target machine is never a good idea. As for fixing problems mentioned in the post I would suggest to compile everything with built-in gcc 4.4 with highest warning level first. You may want to set `-fno-strict-aliasing` because UB related to strict aliasing violations are quite common but hard to track. And you definitely need to find a way to check the work of involved modules separately. – user7860670 Aug 31 '17 at 12:11
  • Fire Lancer, for sure the bug is in our code – just a question of how to find it. Heap corruption, run away pointer, etc. The application does a lot of protocol packet parsing so unfortunately has assumptions about the size of int, short, long, etc. I'd like to port it to 64-bit but that'd take months. As for AVX, we're not "using" AVX per se. g++ just emits opcodes valgrind doesn't understand... Will keep trying with `AddressSanitizer` – Danny Aug 31 '17 at 12:14
  • It is definitely worth checking code with valgrind. If you compile all your code from source without requesting AVX instructions and there is no hand-written assembly then there should be no AVX opcode in the binary confusing valgrind. Most likely it just some form of corruption occurring. – user7860670 Aug 31 '17 at 12:15
  • VTT, thanks. I'll recompile with `-fno-strict-aliasing` and `-mno-avx` and see how if Valgrind works. For your reference, I've added the original compiler command line with option flags to the question. – Danny Aug 31 '17 at 12:35
  • VTT, recompiled with `-mno-avx` and now Valgrind works!! It is already reporting memory issues. Thanks!! – Danny Aug 31 '17 at 15:43
  • VTT, perhaps you have some insight for this question: https://stackoverflow.com/questions/45992636/gcc-c-disable-generation-of-vex-instructions – Danny Sep 01 '17 at 02:43

0 Answers0