0

I have a long running linux daemon which created a 'non-blocking' socket bind on port 65445 waiting for UDP packet. It can work most of time.

Right now, I met an issue that the process becomes "D" after a while(not sure which message caused it, but it's for sure that the daemon can process such message correctly most of time, just at some random point, it failed)

At this stage, the process doesn't take any signal, so I can't kill it, dump kernel stack:

Kernel Status:
[<ffffffff80385ff3>] number.isra.2+0x2d3/0x300
[<ffffffff802e0228>] address_space_init_once+0x88/0x120
[<ffffffff802e0200>] address_space_init_once+0x60/0x120
[<ffffffff802df880>] inode_wait+0x0/0x10
[<ffffffff802df889>] inode_wait+0x9/0x10
[<ffffffff802df880>] inode_wait+0x0/0x10
[<ffffffff80275dc0>] wake_bit_function+0x0/0x30
[<ffffffff802e0abd>] iget_locked+0x11d/0x180
[<ffffffff8030e1f0>] proc_get_inode+0x10/0xf0
[<ffffffff80313555>] proc_lookup_de+0x75/0xf0
[<ffffffff802d31bc>] d_alloc_and_lookup+0x3c/0x90
[<ffffffff802deeee>] d_lookup+0x2e/0x60
[<ffffffff802d3ea6>] do_lookup+0x296/0x3a0
[<ffffffff802ddcee>] dput+0x1e/0x190
[<ffffffff802d49eb>] link_path_walk+0x12b/0x850
[<ffffffff802dddb2>] dput+0xe2/0x190
[<ffffffff802d7437>] path_openat+0xb7/0x370
[<ffffffff802b4cfe>] tlb_finish_mmu+0xe/0x50
[<ffffffff802d7824>] do_filp_open+0x44/0xb0
[<ffffffff802e2905>] alloc_fd+0x45/0x130
[<ffffffff802c8a5c>] do_sys_open+0xec/0x1d0
[<ffffffff805a8afb>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

It indicates something wrong in procfs, after further investigation, I found the net directory of this process's procfs was corrputed, I can't even do ls /proc/*pid*/net, bash also hangs over there.

I narrow down that the process may hang at 'recvfrom' which I just can't understand as it's a non-block socket, part of my code is following:

fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
if (fd < 0) {
    return ret;
}

if (setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, DEV, sizeof(DEV))
    < 0) {
    goto Exit;
}

rt = setsockopt(fd, SOL_SOCKET, SO_SNDBUF, (char *)&sock_buf, sizeof(sock_buf));
if (rt < 0) {
    goto Exit;
}

rt = setsockopt(fd, SOL_SOCKET, SO_RCVBUF, (char *)&sock_buf, sizeof(sock_buf));
if (rt < 0) {
    goto Exit;
}

val = fcntl(fd, F_GETFL, 0);
if (val < 0) 
    return -1;
if (val & O_NONBLOCK)
    return 0;
val |= O_NONBLOCK;
fcntl(fd, F_SETFL, val);

memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET, addr.sin_port = PORT;
addr.sin_addr.s_addr = INADDR_ANY;

if (bind(fd, (void *)&addr, sizeof(addr)) < 0) {
    goto Exit;
}

I add this socket in epoll

struct epoll_event e;
e.events = EPOLLIN;
e.data.ptr = comm_handle;
rc = epoll_ctl(epoll_fd, EPOLL_CTL_ADD, fd, &e);

receive packet when there is an event:

socklen_t from_len = sizeof(struct sockaddr_in);
memset(recv_buf, 0, 65000);
len = recvfrom(e->fd, recv_buf, 65000, 0, (struct sockaddr *)&from, &from_len);
if (len <= 0) {
    return -1;
}
Gemini
  • 107
  • 1
  • 9
  • The kernel trace you show indicates you've called open() - is there any place your code opens a file ? I'd also suggest running your program through valgrind, to see if there's any memory corruption going on. Also if your call is stuck in recvfrom, try to figure out the value of `e->fd` - e.g. tings might have been mixed up, so `e->fd` is something it should not be (meaning you're stuck trying to receive from fd 0 (stdin, unless has been closed, or another descriptor that refers to a socket for dns lookup or similar). – nos Dec 06 '18 at 19:24
  • That said, if the system hangs at /proc/*pid*/net , your system might just be broken - look for issues in /var/log/messages or other places kernel logging might be placed on your system – nos Dec 06 '18 at 19:28
  • I didn't call open() in my program, only socket related process. Is it possible that some socket syscalls like "recvfrom" will trigger open some prof file in "/proc/*pid*/net"? I doubt that the kernel is accessing "/proc/*pid*/net" and hangs over there as I mentioned bash also hangs when I manually access "proc/*pid*/net". Unfortunately, it's a customized linux kernel(3.2.16) I don't have kernel log. – Gemini Dec 06 '18 at 22:25
  • There are a few socket functions such as if_index, getifaddrs , getaddrinfo, gethostbyname, and possibly a few other that will open /proc/net/ . None of the basic socket/bind/accept/recvfrom/setsickopt/sendto does. At any rate, if stuff in /proc/net/ hangs, it's not due to your user space code, something is severely wrong with your kernel/drivers or hardware. – nos Dec 06 '18 at 22:42
  • Just want to clarify that not "/proc/net" corrupted, only my process(in this case, pid 424), "proc/424/net" corrupted, "proc/net" and all other processes' procfs are seems good. Only reboot can recover, and it takes several hours to reproduce. – Gemini Dec 06 '18 at 22:59
  • Regardless, user spae code should not be able to make that happen. – nos Dec 06 '18 at 23:00

0 Answers0