G-WAN with valgrind? Alternatives?

Question

G-WAN is a convenient way to run C code on the web out-of-the-box, but for me it does not work with valgrind. (Running valgrind ./gwan there is a error message Inconsistency detected by ld.so: rtld.c: 1292: dl_main: Assertion `_rtld_local._dl_rtld_map.l_libname' failed! and then it exits; the system is Debian Jessie 64bit).

The question is:
1) Is G-WAN supposed to work with valgrind?
2) Are there any other viable options to detect memory bugs in C code running under G-WAN?

Gil · Answer 1 · 2013-07-20T13:01:01.493

1

Is G-WAN supposed to work with valgrind?

We have tested Valgrind and while it does many things right, it is just not suitable for high-concurrency jobs (even low-concurrency is a problem with Valgrind).

viable options to detect memory bugs in C code running under G-WAN?

Use malloc() wrappers, pre-allocated pools, or even better, use alloca() to avoid memory issues in the first place.

Note that G-WAN handles bad pointers in C scripts without crashing the server, see: http://gwan.ch/developers#crash

This buggy code:

int main(int argc, char *argv[])
{
   strcpy(0xBADC0DE, 0xBADC0DE);
   return 200;
}

...will produce something like the following 'graceful' crash report:

Script: crash_libc.c
 Client: 127.0.0.1
 Query : ?crash_libc

 Signal        : 11:Address not mapped to object
 Signal src    : 1:SEGV_MAPERR
 errno         : 0
 Thread        : 0
 Code   Pointer: 0000f5200b33 (module:/lib/libc.so.6, function:strcpy, line:0)
 Access Address: 00000badc0de

 Registers     : EAX=00000badc0de CS=00000033 EIP=0000f5200b33 EFLGS=000000010202
                 EBX=000000000001 SS=ec2d8ed4 ESP=0000f5ded828 EBP=0000f5dee020
                 ECX=000033323130 DS=ec2d8ed4 ESI=0000ec2d8f86 FS=00000033
                 EDX=000003b03c00 ES=ec2d8ed4 EDI=00000badc0de CS=00000033

 Module        :Function        :Line # PgrmCntr(EIP)  RetAddress  FramePtr(EBP)
      libc.so.6:          strcpy:     - 0000f5200b33 0000ec2d8f00   0000f5dee020
        servlet:            main:    37 0000ec2d8f00 00000042e10c   0000f5dee020

And G-WAN goes as far as to tell you where the bug happened in your source code (see the G-WAN crash_xxx.c examples) instead of killing the server process.

If you don't want to debug C code, then use Java or Scala (both supported by G-WAN) - you will need much more memory because your data will remain loaded until the GC slows-down everything to free what it thinks can be freed - but at least you will enjoy fewer memory-related bugs, if any.

Per the request of the person asking the question, here are more details.

In late 2012, we have tested a dozen of free and commercial tools which, like Valgrind, are supposed to help debugging concurrency. We also used static tools studying source code, and not only dynamic tools working on running (compiled) programs.

The sad truth is that they all suffer from common problems, they:

are generally too slow to support concurrency (the core issue)
produce gazillions of trivial alerts (and even more false alerts)
are very expensive (that's or the commercial ones of course) and cannot always be tested before buying(!)

So, after weeks checking and filtering all those results, we have spent a lot of time "correcting" the G-WAN codebase to remove the trivial and false alerts (alerts caused by tools that can't distinguish valid code from buggy code)... but, to our dismay at the time, we haven't found any real bug in G-WAN (making it clear that those weeks were wasted time).

Hence the conclusion above: try to make simple code when possible, and try to pre-allocate blocks when more sophisticated strategies are needed.

Of course, the fact that the Linux LIBC insists to kill applications with (non-catchable) abort signals does not help (this prevents the program from recovering or from dumping a relevant trace), especially for the sloppy double-free Linux LIBC detection (which wrongly assumes that all the code is using its malloc() when a program has used malloc() once - which is often done by LIBC calls!). And I am not even talking about mmap() failures nor about the OOM kill-switch.

The only solution that we have found working so far is to avoid using the Linux LIBC, and to compile everything we need with our own C runtime. This is a bit difficult to recommend as "the thing to do" for all users, but it worked for us.

We would be very happy to see portions of our code (or at least some of the concepts implemented in G-WAN) used by Linux, as this would make our life (and the one of many other developers) immensely easier, but the contacts that we have had in the past with "the people in charge" were not encouraging.

All in all, there's room for improvements, from the OS, from ISVs like us, and from developers - after all, concurrency is "only" mainstream since 2004... almost ten years ago.

edited Jul 20 '13 at 13:01

answered Jul 16 '13 at 12:34

Gil

3,279
1
15
25

Valgrind runs on a signle core, but it debugs the high-concurrency jobs NP. When you have a bug, you can always redirect *a part* of the trafic to a valgrind-ed server. LD_PRELOAD malloc wrappers like DUMA and ElectricFence does not work with G-WAN either (DUMA crashes right away, EF in `stream3.c`). Memory handling isn't as simple as alloca when one is dealing with asynchronous code and even with alloca every large program will have bugs and memory corruptions if it is at all possible. Thanks for your answer. The question stands, however: what is a working valgrind alternative with G-WAN? – ArtemGr Jul 16 '13 at 13:50
I will add more details in the reply above, but in short (really capable) concurrency debug tools are not commercially available at the moment. – Gil Jul 17 '13 at 15:08
Thanks again for the answer, Gil. Your frustration with the scene isn't really a solution for the problem of debugging the code, though. What about "-fsanitize=address" in new compilers, do you plan to support that in G-WAN? – ArtemGr Jul 17 '13 at 17:38
1

Our *"frustration"* stopped when we stopped relying on Linux LIBC. And "-fsanitize" kills the host process after reporting the first error (just like LIBC) - while demonstrating by its sole existence that there are pending issues in the kingdom. Besides, the G-WAN memory allocator is faster, safer, and does not die with double-free calls or other NOPs, and it can report whether a pointer is still attached to an allocated block or not (by reporting the block size)... without killing the process. – Gil Jul 18 '13 at 12:25
Having a dangling pointer is undefined behaviour, you don't want such process running, it might lead to unrecoverable data corruption. Valgrind and "-fsanitize" allow one to detect the error where it happens or at least closer to that site. It is a *feature* of these tools, not a problem. They are used to test the code (cf. http://cplusplusmusings.wordpress.com/2013/03/20/testing-libc-with-address-sanitizer/), not to make a faulty code run 24x7. Returning to G-WAN, how the user code would check "whether a pointer is still attached to an allocated block"? What function it is? Thanks. – ArtemGr Jul 18 '13 at 13:30
ArtemGr wrote: *"Having a dangling pointer is undefined behaviour, you don't want such process running"*. Unless you have a way to check pointers before calling free() or before dereferencing them - and that's what G-WAN's allocator allow us to do. We did not export these capabilities for scripts because people rely on libraries which make hidden malloc() calls (like LIBC) and the only way to cope with those pointers is to pass them on to the LIBC malloc()/free()... with the unmanageable problems that you have depicted. – Gil Jul 19 '13 at 14:19
_"Unless you have a way to check pointers before calling free() or before dereferencing them"_ - That doesn't make dangling pointers safe. – ArtemGr Jul 19 '13 at 14:42
It absolutely does - when your API calls embed this logic transparently for callers. Note that this was the full point of bothering to rewrite the system memory allocator. – Gil Jul 20 '13 at 12:11
No it does not, double-free is not the only problem with dangling pointers. – ArtemGr Jul 20 '13 at 12:12
Re-read what I wrote, and you will see that our memory-allocator is addressing all the other deadly cases - without ever killing the process. – Gil Jul 20 '13 at 12:18
You only wrote that your memory allocator checks if the pointer is already freed to prevent double-free errors. I see no mention of your *allocator* checking all pointer access to prevent a dangling pointer from corrupting data and to detect it early, the way valgrind or -fsanitize does check it. – ArtemGr Jul 20 '13 at 12:22
I wrote: *"it can report whether a pointer is still attached to an allocated block or not (by reporting the block size)"* and this allows our code to check if a pointer is valid or not. – Gil Jul 20 '13 at 12:25
I got it, but it doesn't help me, software developer, in any way. If there is a dangling pointer in my code or in a library code I use, G-WAN will not detect it. – ArtemGr Jul 20 '13 at 12:33
**G-WAN 'graceful' crash reports detect and report bad pointers** (see the updated answer above)... unless LIBC decides otherwise and kills the G-WAN process with a non-recoverable ABORT signal. Hence my remark above that if Linux accepted some of G-WAN code (or concepts) then this *"would make our life (and the one of many other developers) immensely easier"*. In the meantime, there's no substitute to debugging (or to using the JVM or the CLR) - but that's not G-WAN's fault. – Gil Jul 20 '13 at 13:04
Nice to know. What I see in G-WAN's 'graceful' crash reports isn't very different from a custom SIGSEGV handler doing stack unwinding or attaching a GDB to the killed process. I use such handlers, with google-coredumper, for example, and they help. It's not a replacement for valgrind or fsanitize, however, because the faulty code can still silently corrupt the data if it happen to write to a memory region that is mapped to a writeable memory page. You need guard regions to catch that. cf. https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerAlgorithm – ArtemGr Jul 20 '13 at 15:21
_"If you don't want to debug C code, then use Java or Scala"_ - I *do* want to debug my code, and I want to do this with the best tools available. – ArtemGr Jul 20 '13 at 15:23
The *"best tools available"* are your brains. Don't expect that a library or tracer will magically resolve everything - those tools have a lot of way to go to become more relevant. As said previously, there are ways to *avoid* the conditions that lead to deadly memory errors, but as long as the Linux LIBC (the system runtime) stays away from those concepts, you will have to *"smach your head on the keyboard to continue"*... and do resolve those pointless issues. What makes new ideas necessary today is the addded complexity injected by concurrency issues. Time is on the side of tools that help. – Gil Jul 22 '13 at 06:51
"_The "best tools available" are your brains_" - and it is a statictically known fact that even the best brains do bugs. "_Don't expect that a library or tracer will magically resolve everything_" - It's not magic, it's a proven technology. Valgrind has helped me many times, it is very useful to pinpoint the location of a bug and to test new code. "_As said previously, there are ways to avoid the conditions that lead to deadly memory errors_" - As said previously, automatic memory management (RAII and shared_ptr) is not that simple with event-driven code. – ArtemGr Jul 23 '13 at 15:16
1

I don't see the technical value of this lattest "philosophical" comment, nor I see any question in it, hence my lack of any meaningfull answer. Your initial question was about suitable alternatives to Valgrind (which does not suit concurrency tasks) and I have answered to the best of my knowledge. Note that nobody felt a better answer could be given, so that's not only me having those mixed feelings. – Gil Jul 24 '13 at 04:28

G-WAN with valgrind? Alternatives?

1 Answers1