2

I'm having a lot of trouble sending netlink messages from kernel module to userspace-daemon. They randomly fail. On the kernel side, the genlmsg_unicast fails with EAGAIN while on the user-side, nl_recvmsgs_default (function from libnl) fails with NLE_NOMEM which is caused by recvmsg syscall failing with ENOBUFS.

Netlink messages are small, maximum payload size is ~300B.

Here is the code for sending message from kernel:

int send_to_daemon(void* msg, int len, int command, int seq, u32 pid) {
    struct sk_buff* skb;
    void* msg_head;
    int res, payload;

    payload = GENL_HDRLEN+nla_total_size(len)+36;
    skb = genlmsg_new(payload, GFP_KERNEL);
    msg_head = genlmsg_put(skb, pid, seq, &psvfs_gnl_family, 0, command);
    nla_put(skb, PSVFS_A_MSG, len, msg);
    genlmsg_end(skb, msg_head);
    genlmsg_unicast(&init_net, skb, pid);

    return 0;
}

I absolutely have no idea why this is happening and my project just won't work because of that! I really hope someone could help me with that.

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
ghik
  • 10,706
  • 1
  • 37
  • 50
  • why are you not checking the return values given by any of the genlmsg_* function. That should be your first step in recognizing which function is causing the problem. – Harman Dec 13 '11 at 08:11
  • I do check the values. `genlmsg_unicast` returns `-EAGAIN`, as I described above, while all other functions succeed. I just removed the checks from above code to make it shorter and show the logic itself. – ghik Dec 13 '11 at 08:23

2 Answers2

2

I was having a similar problem receiving ENOBUFS via recvmsg from a netlink socket. I found that my problem was the kernel socket buffer filling before userspace could drain it.

From the netlink(7) man page:

   However, reliable transmissions from kernel to user are impossible in
   any case.  The kernel can't send a  netlink  message  if  the  socket
   buffer  is  full:  the message will be dropped and the kernel and the
   user-space process will no longer have the same view of kernel state.
   It  is  up  to  the  application to detect when this happens (via the
   ENOBUFS error returned by recvmsg(2)) and resynchronize.

I addressed this problem by increasing the size of the socket receive buffer (setsockopt(fd, SOL_SOCKET, SO_RCVBUF, ...) , or nl_socket_set_buffer_size() if you are using libnl).

Dave
  • 21
  • 4
2

I wonder if you are running on a 64bits machine. If it is the case, I suspect that the use of an int as the type of payload can be the root of some issues as genlmsg_new() expects a size_t which is 64bits on x86_64.

Secondly, I don't think you need to add GENL_HDRLEN to payload as this is taken care of by genlmsg_new() (by using genlmsg_total_size(), which returns genlmsg_msg_size() which finally does the addition). Why this + 36 by the way? Does not look very portable nor explicit on what it is there for.

Hard to tell more without having a look at the rest of the code.

Quentin Casasnovas
  • 1,079
  • 5
  • 10
  • Thanks for your interest :) I am running it on a 32bit machine. Either way, I don't think this is the case because my messages are very small (at most, let's say 1KB). Payload computation in my code is a result of me not being able to figure out what exactly should I pass to `genlmsg_new`. If I don't add these 36 bytes, `nla_put` will fail. I know this is very ugly. I will show some more code in my post. The whole code is here, if you are interested: https://github.com/ghik/PS (it's a bit ugly, but it's not going to be maintained or extended so it doesn't really matter). – ghik Dec 13 '11 at 23:26