2

I'm trying to implement kernel-user communication using netlink sockets and the generic message type. So far I'm able to send Messages from user space to the kernel and then send a message back to user space. The Problem is that in my user space program I always get an error that an invalid/malformed message was received. In the user space program I'm using libnl for the netlink communication.

The relevant netlink kernel code Looks like the following:

enum nl_tdisk_attr {
    NL_UNSPEC,
    NL_MY_ATTR,    //My argument
    __NL_MAX
};
#define NL_MAX (__NL_MAX - 1)

enum nl_tdisk_msg_types {
    NL_CMD_READ = 0,
    NL_CMD_MY_CMD    //My command
    NL_CMD_MAX
};

//Family definition
static struct genl_family family = {
    .id = GENL_ID_GENERATE,
    .name = "my-family",
    .hdrsize = 0,
    .version = 0,
    .maxattr = NL_MAX,
};

//Command definition
static struct genl_ops ops[] = {
    {
        .cmd = NL_CMD_MY_CMD,
        .doit = genl_register,
    }
};

//...
//When the module is loaded:
genl_register_family_with_ops(&family, ops);


//Now some data should be sent to user space:
struct sk_buff *msg= nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
void *hdr = genlmsg_put(msg, port/*note1*/, 0, &family, 0/*note2*/, NL_CMD_MY_CMD);
nla_put_u32(msg, NL_MY_ATTR, some_value);
genlmsg_end(msg, hdr);
genlmsg_unicast(&init_net, msg, port/*note1*/); //note3

Please note that I removed error checking to reduce the amount of code

Some notes:

  • note1: The port of the user space program is stored internally in the kernel module - I'm 100% sure it is correct
  • note2: In the flags I also tried to set NLM_F_REQUEST but without success
  • note3: The function genlmsg_unicast always Returns 0 which means that the message was sent successfully. So I assume the kernel code should be fine.

And here the user space code:

#include <netlink/netlink.h>
#include <netlink/socket.h>
#include <netlink/types.h>
#include <netlink/genl/genl.h>
#include <netlink/genl/ctrl.h>
#include <netlink/genl/mngt.h>

//...
struct nl_sock *socket = nl_socket_alloc();

//I explicitly set those callbacks to get some debug information
nl_socket_modify_cb(socket, NL_CB_MSG_IN, NL_CB_DEBUG, NULL, NULL);
nl_socket_modify_cb(socket, NL_CB_INVALID, NL_CB_DEBUG, NULL, NULL);

//I also tried to Play around with the buffer size:
nl_socket_set_buffer_size(socket, 65536, 65536);

genl_connect(socket);
familyId = genl_ctrl_resolve(socket, "my-family");    //This works and gives me the correct Family id

nl_recvmsgs_default(socket);

As soon as the kernel sends a message I see debug Information in the user space program but sadly it's just error Messages:

-- Debug: Received Message:
--------------------------   BEGIN NETLINK MESSAGE ---------------------------
  [NETLINK HEADER] 16 octets
.nlmsg_len = 308
    .type = 23 <0x17>
    .flags = 0
    .seq = 0
    .port = -1765782228
  [GENERIC NETLINK HEADER] 4 octets
    .cmd = 1
    .version = 1
    .unused = 0
  [PAYLOAD] 4 octets
    08 00 02 00                                     ....
---------------------------  END NETLINK MESSAGE   ---------------------------
-- Error: Invalid message: type=0x17 length=24 flags=0 sequence-nr=0 pid=2529185068

As you can see, after the line "END NETLINK MESSAGE" the is the message from the callback NL_CB_INVALID which is telling me that in invalid message was received.

So actually the communication per se is working as it should it just receives an invalid message, don't know why. Does anybody know where I can look for more Information? WHY is the message malformed... Or even better: does anyone see an error in my code? Or does anyone know a really good Website which describes such a Scenario?

Thomas Sparber
  • 2,827
  • 2
  • 18
  • 34

2 Answers2

2

After a Long time of Trial and error I finally found some Kind of solution. The Problem actually was to modify the "invalid-message" callback : nl_socket_modify_cb(socket, NL_CB_INVALID, NL_CB_DEBUG, NULL, NULL);

By modifying it, nl_recvmsgs_default(socket); always returned 0 meaning there was no error. After removing that callback, I realized that nl_recvmsgs_default(socket); returned -16 which - according to the doocumentaion - means "Message sequence number mismatch". For some reason it doesn't accept sequence number 0, I don't know why...

To solve the Problem, I added nl_socket_disable_seq_check(socket); in the user space program. I guess it's not an optimal solution, so if you know a better solution please let me know!

Thomas Sparber
  • 2,827
  • 2
  • 18
  • 34
2

(BTW: This answer doesn't make sense if you haven't read @ThomasSparber's own answer first, which identifies the root of the problem and a workaround.)

You can specify the sequence number during genlmsg_put. libnl expects the response seqnum to be the same as the request's.

Assuming you're calling genlmsg_put during genl_register:

int genl_register(struct sk_buff *skb, struct genl_info *info)
{
    ...
    genlmsg_put(msg, port, info->nlhdr->nlmsg_seq, &family, 0,
            NL_CMD_MY_CMD);
    ...
}

That should do it. Disabling seqnum analysis is probably bad since you might mix request-responses during multithreaded userspace clients and whatnot.


By the way, this is also probably bad:

struct sk_buff *msg= nlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);

NLMSG_GOODSIZE is not a good size for nlmsg_new; its a good size for the whole packet. The whole packet is whatever you send to nlmsg_new plus at least the netlink header size, and you don't want it to exceed PAGE_SIZE. NLMSG_DEFAULT_SIZE is generally a better candidate for nlmsg_new.

BUT, since you're using Generic Netlink, you probably want to scratch that altogether and do

struct sk_buff *msg= genlmsg_new(GENLMSG_DEFAULT_SIZE, GFP_KERNEL);

(Unfortunately, GENLMSG_DEFAULT_SIZE is not available in somewhat older kernels.)

Yd Ahhrk
  • 1,088
  • 12
  • 24
  • Very nice answer - thanks! `nlmsg_new` sounds very good - I will try that! I know that I can set the request number when I'm "answering" a request. My Problem is that the Kernel is sending a request to user space, so I don't have `info->nlhdr->nlmsg_seq` available. – Thomas Sparber Mar 02 '16 at 06:53
  • @Thomas The ["expected" sequence number](https://github.com/tgraf/libnl/blob/dcc537597728c84d47fe9aff32b982c72055a1ad/lib/nl.c#L833) is initialized as something [that is not likely zero](https://github.com/tgraf/libnl/blob/dcc537597728c84d47fe9aff32b982c72055a1ad/lib/socket.c#L193). If the kernel is the one making the request, sequence number checking done by userspace really doesn't make sense in your code, at least for the first packet. Your own answer (disabling sequence checking) is correct. – Yd Ahhrk Mar 30 '16 at 17:06