3

I have a server written in C which is blocked at function accept() and awaits new incoming connections. When a new connection is accepted, it creates a new process by calling fork(). I don't use epoll as each client socket is handled by a independent process, and one of the libraries it uses crashes in multi-thread environment.

Here is the code of server:

srv_sock = init_unix_socket();
listen(srv_sock, 5);
/* Other code which handles SIGCLD. */
while (1) {
    log_info("Awaiting new incoming connection.");
    clt_sock = accept(srv_sock, NULL, NULL);
    if (clt_sock < 0) {
        log_err("Error ...");
        continue;
    }
    log_info("Connection %d accepted.", clt_sock);

    cld_pid = fork();
    if (cld_pid < 0) {
        log_err("Failed to create new process.");
        close(clt_sock);
        continue;
    }
    if (clt_pid == 0) {
        /* Initialize libraries. */
        /* Handle client connection ...  */
        shutdown(clt_sock, SHUT_RDWR);
        close(clt_sock);
        _exit(0);
    }
    else {
        log_info("Child process created for socket %d.", clt_sock);
        close(clt_sock);
    }
}

The client is written in Java, it connects to the server by using the library junixsocket since Java doesn't support Unix domain socket. When it is connected with the server, it send a request (a header + XML document) and waits for reply from server.

Here is the code of client:

File socketFile = new File(UNIX_SOCKET_PATH);
AFUNIXSocket socket = AFUNIXSocket.newInstance();
socket.connect(new AFUNIXSocketAddress(socketFile));

InputStream sis = socket.getInputStream();
OutputStream sos = socket.getOutputStream();
logger.info("Connected with server.");

byte[] requestHeader;
byte[] requestBuffer;

sos.write(requestHeader, 0, requestHeader.length);
logger.info("Header sent.");

sos.write(requestBuffer, 0, requestBuffer.length);
logger.info("Request XML sent.");

sos.flush();

Now the problem is when I have 3 client threads which connect to server at the same time. I always have 1 task running while the other 2 keep waiting until the first one is finished.

I have checked the logs. All the 3 client threads have connected and sent request to server at (almost) the same time, but the server has only accepted the first one arrived, and delayed the 2 others. According to logs, there is a delay of 3 minutes between connect on client side and accept on server side.

At first I thought the delay might be caused by some sort of buffer, so I call OutputStream.flush() after each OutputStream.write call, but the problem persists.

I cannot figure out what might cause this delay, any idea please ?

Thank you.

Update Mar 15 2016

pstack shows that the parent process was blocked at waitpid in my SIGCHLD handler. This was problably why the accept didn't return when new incoming connection arrived as the execution procedure was interrupted by the signal handler.

Here is the code of my signal handler:

static void _zombie_reaper (int signum) {
    int status;
    pid_t child;

    if (signum != SIGCHLD) {
        return;
    }
    while ((child = waitpid(-1, &status, WNOHANG)) != -1) {
        continue;
    }
}

/* In main function */
struct sigaction sig_act;
memset(&sig_act, 0, sizeof(struct sigaction));
sigemptyset(&sig_act.sa_mask);
sig_act.sa_flags = SA_NOCLDSTOP;
sig_act.sa_handler = _zombie_reaper;
if (sigaction(SIGCHLD, &sig_act, NULL) < 0) {
    log_err("Failed to register signal handler.");
}
vesontio
  • 381
  • 1
  • 7
  • 19
  • And where is the code of the C server side? That'd be the first suspect for this kind of problem. It is quite hard to solve this problem without any code at all. – Antti Haapala -- Слава Україні Mar 15 '16 at 06:30
  • Sorry @AnttiHaapala, I've added the server code. – vesontio Mar 15 '16 at 06:42
  • Upvoted. So far I cannot see anything wrong with the server code :( How is the log output, is there a delay of 3 minutes between `Awaiting new incoming connection.` and `Connection accepted`? Perhaps it is on the client side then – Antti Haapala -- Слава Україні Mar 15 '16 at 06:52
  • Minutes is a long time. It looks like your server refuses the connection and the clients reconnect after a minute. Check the 'backlog' argument of 'listen' in your server and try a larger value. Let's try 64. – Marian Mar 15 '16 at 06:55
  • Not sure about linux, but on a MAC, the man page for `fork` has this ominous warning: *"There are limits to what you can do in the child process. To be totally safe you should restrict yourself to only executing async-signal safe operations until such time as one of the exec functions is called. All APIs, including global data symbols, in any framework or library should be assumed to be unsafe after a fork() unless explicitly docu- mented to be safe or async-signal safe."* – user3386109 Mar 15 '16 at 06:58
  • 1
    @user3386109 usually the problem is mixing `fork()`ing with threading; if you use someone else's frameworks you might have threads without knowing about it – Antti Haapala -- Слава Україні Mar 15 '16 at 07:01
  • 4
    @user3386109 may be right. Do you close 'srv_sock' socket in the child process? – Marian Mar 15 '16 at 07:02
  • 1
    Is there a reason you're using `fork`? If you care about performance, launching a separate thread (let alone process) per socket is going to make you cry when you scale up to 10000 connections (and thus 10000 threads/processes/whatever). Consider using non-blocking or asynchronous socket calls, or setting the socket options to non-blocking (which is different to using non-blocking or asynchronous socket calls). You should be able to achieve a reasonably performing server app with just one thread. If that doesn't perform well, you could scale this up well using `pthread_create`, **not** `fork`. – autistic Mar 15 '16 at 07:04
  • @Seb it is said in the question. Fork is needed for a linked library to function properly. "one of the libraries it uses crashes in multi-thread environment." – Antti Haapala -- Слава Україні Mar 15 '16 at 07:26
  • @vesontio which operating system are we at? – Antti Haapala -- Слава Україні Mar 15 '16 at 07:28
  • @AnttiHaapala Did you read the rest of my message? Why can't non-blocking sockets be used? You're not OP, so I don't expect you can answer this... but if you were OP and you responded like that, I would explain that you're [wasting our time with an XY problem](http://xyproblem.info/). Don't ask about your apparent solution for the crash. Ask about the crash. It could be signs of more sinister things to come, that `fork` can't solve. – autistic Mar 15 '16 at 08:19
  • Hi @AnttiHaapala, I've added the Java code of client. I've simply compared the logs of server and client. There is a delay of 3 min between the `Connection accepted` and `Connected with server`. Both programs run on the same machine, so they share the same system time. – vesontio Mar 15 '16 at 09:02
  • Hi @Marian, thank you for your suggestion, my code now uses a `backlog` of size 5, I will test with a larger value. And ... I never close the `srv_sock` in child process. I do have a signal handler which `wait` for terminated child process, and I have only used async-signal-safe functions in the signal handler. – vesontio Mar 15 '16 at 09:04
  • @user3386109 I thought the `async-signal-safe` thing is for signal handler. – vesontio Mar 15 '16 at 09:12
  • Hi @Seb, this is what I've done in my first version, `epoll` + `pthread` etc. and each task is handled by an independent thread. However the library I have to use suffers from occasional crash, when one task makes the library crash, the entire program is down, so are all the other parallel tasks. That's why I have to create this very multi-process version. – vesontio Mar 15 '16 at 09:18
  • @AnttiHaapala I'm using Centos 6.6. – vesontio Mar 15 '16 at 09:19
  • Try to close 'srv_sock' in the child process immediately after the fork. – Marian Mar 15 '16 at 10:47
  • @vesontio Why do you need each task to be handled by an independent thread? Threads are an optimisation... Do you have any evidence to suggest that a single-threaded version is too slow? The problem with your approach is you don't know how fast/slow a single-threaded version is, so you can't compare that to your parralellised version. You're guessing about optimisation, and most likely getting it wrong. Once you have a single-threaded version done, use a profiler to determine what the most significant bottleneck is, and work on optimising that (without threads, at first; they're last resort). – autistic Mar 15 '16 at 10:57
  • "Ohh, but multithreading is going to be faster" -- a very unfortunate and misinformed, but typical response to my line of questioning... It makes sense that a CPU that has 8 cores can only execute 8 things simultaneously; if you fire up more than 8 threads you might be wasting context switches. Are you aware that [the C10K problem](http://www.kegel.com/c10k.html) (10000 FTP clients on one server) was solved using a single thread, all the way back in the year of 1999? Surely with this understanding you might realise why I'm curious you haven't tried a single-threaded version... – autistic Mar 15 '16 at 11:10
  • The 1. thing your child process should do is `close(srv_sock)` , as there's no reason for the child process to use that socket - this can help expose other bugs you might have if e.g. a bug causes child process to start using the wrong file descriptor. Same thing with other file descriptors that the child process should not use, close them. – nos Mar 15 '16 at 11:45
  • @nos, thanks, I'll do it. – vesontio Mar 15 '16 at 11:47

1 Answers1

1

Your waitpid() condition is wrong, you only want to continue calling waitpid() if it collected a child process, so you need to do

while ((child = waitpid(-1, &status, WNOHANG)) > 0) {
     continue;
 }
nos
  • 223,662
  • 58
  • 417
  • 506