4

I've been working on a webrtc datachannel library in C/C++ and wrote a program in C to:

  1. Create two peers from the same process.
  2. Establish a connection between them.
  3. Close the connection if it's successful.

Everything runs fine on a debian docker container and on my host opensuse tumbleweed (all x86_64 and 64bit), but on alpine linux container (64bit x86_64), I'm getting a SEGFAULT inside the child processes:

segfault in libnice

The function above is from the program's dependency "libnice". It seems like *agent == NULL and there is no way that is made null in the caller's scope. I even inserted a printf("Argument is %p", agent); right before the function call and it prints out its memory and I can verify it's not null. From the disassembly, it looks like the line where copying the agent's contents (0x557a1d20) as the local variable in the callee's stack results in a segfault. The segfault always occurs at this point even after a make clean and recompilation. Fail at activation record? Stack corruption?

UPDATE: I made a more lightweight container and ran it, and now it segfaults at a different place in that same priv_conn_keepalive_tick_unlocked. The argument seems to be set though (Notice the 0x7ffff7f9ad08): segfault2

Since I thought I might be hitting the libmusl's default stack limit of 80k, I used getrlimit(RLIMIT_STACK, &rl) to obtain the stack size and it looks like it's already 8 MB and not 80k. Increasing this limit further does not seem to make any difference except that if I assign more than 8 MB, my program crashes early inside the Gdb. Gdb says it got an unknown signal "? ?"; outside the gdb, it crashes at the normal point where it normally crashes without the altered stack size.

I'm not sure what exactly the problem is (stack corruption?) and what to do next to resolve this.

Here's my program's flow:

For every peer that is created, a child process is created with a fork(). Parent <--> child communication is done by ZeroMQ and I use protocol buffers to forward any callbacks (and its arguments) that are triggered inside the child onto an event loop running in the parent process.

So for the above program, there are 2 child processes and 1 parent process.

Steps to reproduce:

Irfan
  • 1,758
  • 3
  • 24
  • 32
  • 1
    Have you considered that the root of that problem exists on all platforms and only chose the single one for appearing? The solution would be to heavily debug and find the root cause, probably made harder by being well hidden on most platforms. – Yunnosch Feb 02 '18 at 06:40
  • Assuming you can reproduce the crash at will, the first thing I would do is insert a printf("calling priv_conn_keepalive_tick_unlocked with argument %p\n", agentPtr); line immediately before the call to that function -- that way you can verify that the call is (or is not) being made with a NULL argument. – Jeremy Friesner Feb 02 '18 at 06:40
  • Also, the fact that the function name ends in the suffix _unlocked makes me suspect that the function's author was trying to communicate that the function doesn't do any of its own mutex-locking, but rather depends on the calling code to lock the appropriate mutexes *before* calling the function. A failure to lock the necessary mutexes first could well result in a crash that only occurs under certain circumstances (e.g. only under one particular OS), which would match your observations. – Jeremy Friesner Feb 02 '18 at 06:42
  • @JeremyFriesner Yes I did add the print statement with the `agent` argument just before the function call and the pointer is not null. – Irfan Feb 02 '18 at 06:57
  • Which other flavours of Linux have you tried this on? Also don't think your analysis of what is going is correct. I suspect you have stack corruption, and your local variables have become "rubbish". – Mats Petersson Feb 02 '18 at 06:58
  • What exactly is the difference between your debian that works and alpine - are they the same bitness (64/32)? Are you executing the same binary compiled with the same compiler, or different binaries compiled with different compilers? – Mats Petersson Feb 02 '18 at 07:00
  • @MatsPetersson The difference I can think of is libmusl instead of glibc on alpine vs debian. The program works on OpenSUSE tumbleweed in addition to debian. All of these are x86_64 64bit. I compile and run the binaries on each platform with the compilers and dependencies available from the package repos. – Irfan Feb 02 '18 at 09:36
  • Can you reproduce this with a [mcve]? There are too many independent variables here. Also, have you run under Valgrind? That's always helpful at finding uses of uninitialised or freed memory that you might otherwise get away with. – Toby Speight Feb 02 '18 at 11:28

2 Answers2

3

On further investigation, the crash is in an instruction writing at a mildly large negative offset from the stack base pointer, so it's probably just a simple stack overflow.

The right way to fix this is reducing the excess stack usage or explicitly requesting a large stack at pthread_create time, but I don't see where pthread_create is being called from. A quick check to verify that this is the problem would be to override the default stack size for new threads by performing the following somewhere early in the program:

pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 1<<20); // 1 MB
pthread_setattr_default_np(&attr);
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
1

Add -Werror=implicit-function-declaration to your CFLAGS and you'll immediately have the cause. The key clue is the pointer value 0x557a1d20, which is almost surely the result of truncating a pointer to 32 bits. This happens when you failed to declare a function that returns a pointer and the compiler (by an awful backwards default) assumes it returns int rather than producing an error, then subsequently allows the implicit conversion from int to pointer despite the C language disallowing it.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711