5

I am using the code provided for the following question numa+mbind+segfault, every call to mbind returns EINVAL. How can I get what is exactly wrong? I am asking this because EINVAL can be returned for many reasons.

page_size = sysconf(_SC_PAGESIZE);
objs_per_page = page_size/sizeof(A[0]);
assert(page_size%sizeof(A[0])==0);
split_three=num_items/3;
aligned_size=(split_three/objs_per_page)*objs_per_page;
remnant=num_items-(aligned_size*3);
piece = aligned_size;

nodemask=1;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);

nodemask=2;
mbind(&A[aligned_size],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);

nodemask=4;
bind(&A[aligned_size*2+remnant],piece*sizeof(double),MPOL_BIND,
     &nodemask,64,MPOL_MF_MOVE);

After running the program (by changing the nodemask before every mbind call to 1,2 and 4 respectively) shown below (as an answer from Mats Petersson). It sometimes segfaults and sometimes runs fine. When it segfaults the dmesg is as follows:

Stack:
Call Trace:
mpol_new+0x5d/0xb0
sys_mbind+0x125/0x4f0
finish_task_switch+0x4a/0xf0
? __schedule+0x3cf/0x7c0
system_call_fastpath+0x16/0x1b
Code: ...
kmem_cache_alloc+0x58/0x130
Community
  • 1
  • 1
tiki
  • 419
  • 1
  • 6
  • 16
  • Looks like a proper kernel crash to me. Not sure why - what exact kernel are you running? Not sure this is an easy one to fix, I'm afraid. Your system is generally stable and running well, yes? – Mats Petersson Jan 27 '13 at 18:35
  • @MatsPetersson This is Ubuntu 12.10. Linux 3.5.0-19-generic #30, x86_64. Thanks. – tiki Jan 27 '13 at 18:50
  • It does look like the relevant code in 3.5 (http://lxr.linux.no/#linux+v3.5/mm/slub.c#L2305) and 3.7.4 (http://lxr.linux.no/#linux+v3.7.4/mm/slub.c#L2317) has changed somewhat, but not significantly. Of course, any bug could be in the few hundred lines of code before the call to kmem_cache_alloc too. I can't really see where this would go wrong tho'. – Mats Petersson Jan 27 '13 at 19:13

1 Answers1

4

Looking at the source of Linux kernel, you can get EINVAL for:

  • Passing in an invalid mode value. Either out of range of "inconsistent" (using both static and relative nodes at the same time)
  • invalid maxnode (> number of bits in a page -> 32K on x86).
  • Various other problems with nodemask.
  • Not having one of MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL
  • start is not page-aligned.
  • start+len when page-aligned = start. [that is, your len is not at least one byte]
  • start+len < start - that is, negative length.
  • policy = MPOL_DEFAULT and nodes isn't empty or NULL.
  • quoting comment from source "MPOL_PREFERRED cannot be used with MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES if the nodemask is empty (local allocation). All other modes require a valid pointer to a non-empty nodemask.

My guess would be on start is not page-aligned.

This code works for me:

#include <numaif.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

#define ASSERT(x) do { if (!(x)) do_assert(#x,(long)(x), __FILE__, __LINE__); } while(0)

static void do_assert(const char *expr, long expr_int, const char *file, int line)
{
    fprintf(stderr, "ASSERT failed %s (%d) at %s:%d\n", 
        expr, expr_int, file, line);
    perror("Error if present:");
    exit(1);
}


int main()
{ 
    size_t num_items = 6156000;
    double *A = valloc(num_items * sizeof(double));
    ASSERT(A != NULL);
    int res;
    unsigned long nodemask;


    size_t page_size = sysconf(_SC_PAGESIZE);
    size_t objs_per_page = page_size/sizeof(A[0]);
    ASSERT(page_size%sizeof(A[0])==0);
    size_t split_three=num_items/3;
    size_t aligned_size=(split_three/objs_per_page)*objs_per_page;
    size_t remnant=num_items-(aligned_size*3);
    size_t piece = aligned_size;

    printf("A[0]=%p\n", &A[0]);
    printf("A[%d]=%p\n", piece, &A[aligned_size]);
    printf("A[%d]=%p\n", 2*piece, &A[2*piece]);


    nodemask=1;
    res = mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
    ASSERT(res ==0);
    nodemask=1;
    res = mbind(&A[aligned_size],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
    ASSERT(res ==0);

    nodemask=1;
    res = mbind(&A[aligned_size*2],(piece+remnant)*sizeof(double),MPOL_BIND,
     &nodemask,64,MPOL_MF_MOVE);
    ASSERT(res == 0);
}

Note that I'm using "nodemask=1" on all allocations, since I've only got one quad-core processor in my machine, so no other nodes to bind to - which also gives EINVAL. I take it you actually have more than one node in your system.

I also moved the "remnant" from the A[] to piece+remnant size for the last mbind call.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • I was running it for SIZE of 6156000, I get piece=2051584, aligned_size=2051584, and remnant=1248. Thanks. – tiki Jan 27 '13 at 14:56
  • So, let's calculate this: 6156000 is 769500 doubles. Each group should have 256500 doubles. But that's not an even number of pages. 256500 is. That's 2048000. Which means the second start will not be aligned when using the "aligned_size". Can you paste the calculations to get aligned_size and remnant in your question, please. – Mats Petersson Jan 27 '13 at 15:11
  • "split_three=num_items/3; aligned_size=(split_three/objs_per_page)*objs_per_page; remnant=num_items-(aligned_size*3); piece=aligned_size;" Thanks. – tiki Jan 27 '13 at 15:21
  • Please post that as part of your QUESTION! And include "objs_per_page" calculation as well. – Mats Petersson Jan 27 '13 at 15:26
  • Thanks. I did compile and run the code above. I changed nodemask to 2 and 4 for second and third mbind calls. Sometimes it runs fine, but sometimes it gives segmentation fault. Thanks. – tiki Jan 27 '13 at 17:08
  • Regarding the nodes, I did run "numactl --hardware" it shows 4 numa nodes (with ids 0,1,2, and 3). Thanks. – tiki Jan 27 '13 at 17:15
  • Before it finished these calls, or at some later point? Obviously, I can't really say what your other code does without seeing it. Typically, seg-fault means you are using incorrect indexes in your array. – Mats Petersson Jan 27 '13 at 17:15
  • I do not have any other code, I am just running the code you provided above as it is. It shows the segfault right after printing all the printf statements with A. And sometimes it runs fine without any segfaults. – tiki Jan 27 '13 at 17:19
  • You may want to add some extra printouts to figure out which of the mbind nodes it fails for. – Mats Petersson Jan 27 '13 at 17:20
  • I did put printf statements after each of the "ASSERT(res==0);", i.e. for first mbind "printf("done1\n");" and so on. When it segfaults it does not print any of those, just shows segfault. But when it runs it shows all the done1, done2, and done3 strings. Thanks. – tiki Jan 27 '13 at 17:24
  • You may want to add a `fflush(stdout)` or use `fprintf(stderr, ...)` - or run it in gdb and see where it crashes... – Mats Petersson Jan 27 '13 at 17:26
  • I did run it using gdb, it just simply shows that "Program terminated with signal SIGSEGV, Segmentation fault. The program no longer exists.". – tiki Jan 27 '13 at 17:29
  • Sounds like you are getting a kernel crash... I've never really used numa interfaces, so I'm not 100% sure how it works. what does `dmesg` show? – Mats Petersson Jan 27 '13 at 17:36
  • That's not what I expected. Is your system logging not set up correctly? – Mats Petersson Jan 27 '13 at 17:45
  • Not quite sure. Googling the problem indicates that it's "fixed by rebooting" and "fixed by updating the kernel", neither of which are particularly greate suggestions. It may also work to clear "dmesg" with "dmesg -C". You probably need to run as root to do that. – Mats Petersson Jan 27 '13 at 18:05
  • I did "dmesg -C" wit sudo privilege, after that run the program and then dmesg. It shows lots of info. Do you want me to show last few lines? Thanks. – tiki Jan 27 '13 at 18:12
  • Probably... But it would probably be best as part of your question rather than a comment. – Mats Petersson Jan 27 '13 at 18:15