Can run in GDB, but segmentation fault when run directly

Question

My program gets a segmentation fault when I run it normally.

However, it works just fine if I use gdb run. Moreover, the ratio of segmentation fault increases when I increase the sleep time in the philo function. I am using Ubuntu 12.04 (Precise Pangolin). Here is my code:

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

#include <signal.h>
#include <sys/wait.h>
#include <time.h>
#include <semaphore.h>
#include <errno.h>

#define STACKSIZE 10000
#define NUMPROCS 5
#define ROUNDS 10

int ph[NUMPROCS];

// cs[i] is the chopstick between philosopher i and i+1
sem_t cs[NUMPROCS], dead;

int philo() {
    int i = 0;
    int cpid = getpid();
    int phno;

    for (i=0; i<NUMPROCS; i++)
        if(ph[i] == cpid)
            phno = i;

    for (i=0; i < ROUNDS ; i++) {
        // Add your entry protocol here
        if (sem_wait(&dead) != 0) {
            perror(NULL);
            return 1;
        }
        if (sem_wait(&cs[phno]) != 0) {
            perror(NULL);
            return 1;
        }
        if (sem_wait(&cs[(phno-1+NUMPROCS) % NUMPROCS]) != 0) {
            perror(NULL);
            return 1;
        }

        // Start of critical section -- simulation of slow n++
        int sleeptime = 20000 + rand()%50000;
        printf("philosopher %d is eating by chopsticks %d and %d\n", phno, phno, (phno - 1 + NUMPROCS)%NUMPROCS);
        usleep(sleeptime);
        // End of critical section

        // Add your exit protocol here
        if (sem_post(&dead) != 0) {
            perror(NULL);
            return 1;
        }
        if (sem_post(&cs[phno]) != 0) {
            perror(NULL);
            return 1;
        }
        if (sem_post(&cs[(phno - 1 + NUMPROCS) % NUMPROCS]) != 0) {
            perror(NULL);
            return 1;
        }
    }
    return 0;
}

int main(int argc, char ** argv) {
    int i;
    void* stack[NUMPROCS];
    srand(time(NULL));

    // Initialize semaphores
    for (i=0; i<NUMPROCS; i++) {
        if (sem_init(&cs[i], 1, 1) != 0) {
            perror(NULL);
            return 1;
        }
    }
    if (sem_init(&dead, 1, 4) != 0) {
        perror(NULL);
        return 1;
    }

    for (i = 0; i < NUMPROCS; i++) {
        stack[i] = malloc(STACKSIZE);
        if (stack[i] == NULL) {
            printf("Error allocating memory\n");
            exit(1);
        }

        // Create a child that shares the data segment
        ph[i] = clone(philo, stack[i] + STACKSIZE - 1, CLONE_VM|SIGCHLD, NULL);
        if (ph[i] < 0) {
            perror(NULL);
            return 1;
        }
    }

    for (i=0; i < NUMPROCS; i++)
        wait(NULL);
    for (i=0; i < NUMPROCS; i++)
        free(stack[i]);

    return 0;
}

Does it still crash if you build the program with `-static`? I observed this crashing in the glibc dynamic loader symbol resolution in the child process, during the first call through the PLT (i.e. `gettpid()`) — scottt, Mar 12 '13 at 19:59
Related: *[Crashes normally, but not with GDB](https://stackoverflow.com/questions/7507336/) (2011)* and *[Segmentation fault disappears when debugging with GDB](https://stackoverflow.com/questions/20496145/)* (2013) — Peter Mortensen, Jul 28 '23 at 10:59

Bryan Olivier · Answer 1 · 2013-03-12T19:15:58.277

A typical Heisenbug: if you look at it, it disappears. In my experience getting a segv only outside gdb or vice versa is sign of using uninitialized memory or dependence on actual pointer addresses. Normally running valgrind is ruthlessly accurate in detecting those. Unfortunately (my) valgrind can not handle your clone outside the pthread context.

Visual inspection suggests it is not a memory problem. Only the stacks are allocated on the heap and their use looks ok. Except you treat them with a void * pointer and then add something to it, which is not allowed in standard-C (a GNU extension). Proper would be to use a char *, but the GNU extensions does what you want.

Subtracting one from the top address of the stack is probably not necessary and might cause alignment errors on simple implementations of clone, but again I don't think that is the problem, as clone most likely will align the stack top again. And admittedly the manual page of clone is not very clear about the exact location of the address: "topmost address of the memory space".

Just waiting for a state change of a child and assuming it died is a bit sloppy and then taking away its stack might lead to segmentation faults, but again I don't think that is the problem, because you are probably not frantically sending signals to your philosophers.

If I run your application the philosophers can finish their diner undisturbed both inside and outside gdb, so the following is a guess. Let's call the parent process that clones philosophers "the table". Once a philosopher is cloned the table stores the returned pid in ph, say assign that number to a chair. The first thing a philosopher does is looking for his chair. If he doesn't find his chair he will have an uninitialized phno which is used to access his semaphores. Now this may very well lead to segmentation faults.

The implementation is assuming that control is returned to the table before the philosophers start. I can't find such guarantee in the manual page and I would actually expect this not to be true. Also the clone interface has a possibility to place process ids in memory shared between the child and the parent, suggesting this is a recognized problem (see parameters pid and ctid). If those are used the pid will be written before either the table or the just cloned philosopher gets control.

It is highly possible that this error explains the difference between inside and outside gdb, because gdb is well aware of the processes that are spawned under its supervision and may treat them differently than the operating system.

Alternatively you could assign a semaphore to the table. So nobody sits at the table until the table says so, obviously after it assigned all chairs. This would make a much better use for the semaphore dead.

BTW. You are of course fully aware that the setup of your solution does allow for the situation where all philosophers end up each having one fork (eh chopstick) and starve to death waiting for the other. Luckily chances of that happening are very slim.

Employed Russian · Answer 2 · 2013-03-12T20:58:50.127

ph[i] = clone(philo, stack[i]+STACKSIZE-1, CLONE_VM|SIGCHLD, NULL) ;

This creates a thread of execution, which glibc knows nothing about. As such, glibc does not create any thread-specific internal structures that it needs for e.g. dynamic symbol resolution.

With such setup, calling into any glibc function from your philo function invokes undefined behavior, and you sometimes crash (because the dynamic loader will use main thread's private data to perform symbol resolution, and because the loader assumes that each thread has its own private area, but you've violated this assumption by creating clones which share the single private area "behind glibc's back").

If you look at a core dump, there is a high chance that the actual crash happens in ld.so, which would confirm my guess.

Don't ever use clone directly (unless you know what you are doing). Use pthread_create instead.

Here is what I see in the core that I just got (which is exactly the problem I described):

Program terminated with signal 4, Illegal instruction.
#0  _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
239             vmovdqa %fs:RTLD_SAVESPACE_SSE+0*YMM_SIZE, %ymm0
(gdb) bt
#0  _dl_x86_64_restore_sse () at ../sysdeps/x86_64/dl-trampoline.S:239
#1  0x00007fb694e1dc45 in _dl_fixup (l=<optimized out>, reloc_arg=<optimized out>) at ../elf/dl-runtime.c:127
#2  0x00007fb694e0dee5 in _dl_runtime_resolve () at ../sysdeps/x86_64/dl-trampoline.S:42
#3  0x00000000004009ec in philo ()
#4  0x00007fb69486669d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

Can run in GDB, but segmentation fault when run directly

2 Answers2

Linked