Segmentation fault with high values (Xeon Phi)

Question

I am working on a Collatz Conjecture problem using Xeon Phi through Stampede. I have tested my code been tested and works fine for values up to 100,000, but testing values up 1 million, I receive a segmentation fault ("SIGSEV") almost immediately. I've been banging my head against the wall for days, but simply cannot figure out the bug. Any help is truly appreciated.

typedef unsigned long long bigInt;

// Number to test to (starting from 1)
   #define bigSize     100000

typedef struct {
    int numSteps;
    bigInt stopPoint;
} batcher;

typedef struct {
    bigInt num;
    batcher to_batch;
} to_ret;
int main () {
    //Stores values as [num][#steps to smaller val][smaller val]
    to_ret retlist[bigSize];
    //Stores values as [#steps to smaller val][smaller val], sorted by num
    batcher results[bigSize];
    ...

    #pragma offload target(mic:0) inout(retlist) shared(retlist)
    {
        #pragma omp parallel for
        for(i = 1; i < bigSize; i++){
            retlist[i].num = i + 1;
            bigInt next = retlist[i].num;
            int count = 0;

            do {
                count++;

                if (next%2 == 1)
                    next=(3*next+1)/2;
                else
                    next/=2;

            } while(next > retlist[i].num);

            retlist[i].to_batch.numSteps = count;
            retlist[i].to_batch.stopPoint = next;
        }
    }

    ///Organizes data into a sorted array
    #pragma omp parallel for
    for (i = 0; i < bigSize; i++){
        results[retlist[i].num - 1] = retlist[i].to_batch;
    }
    ...
}

I'm pretty confident that the issue would be somewhere in the code segment above.

You're probably running out of stack memory space. Might help to allocate those two arrays in the data-section (i.e., declare them `static` and/or global). This will dramatically increase the size of your executable image, as well as the time it takes the OS to load it into memory before execution. Alternatively, you can increase the size of the stack itself. This is typically done through the linker settings of your project. — barak manos, Dec 20 '14 at 21:40
Declaring the arrays globally did the trick! Thank you so much! — Sam, Dec 20 '14 at 22:08
To increase the process stack size run ulimit -s unlimited on the xeon phi in the shell that you execute the program from. If you use openmp at some point you will need to set the OMP_STACKSIZE environment variable to something larger. — amckinley, Dec 20 '14 at 22:48

score 0 · Answer 1 · edited Dec 21 '14 at 07:15

The following code compiles properly:

does not overflow the stack
does not obfuscate the code with a bunch of typedefs for structs
does not hide the bigNum being a unsigned long long int.
does include the declaration of the index variable 'i'

I did not have access to the optimization pragmas, so commented them out, for now.

//typedef unsigned long long bigInt;

// Number to test to (starting from 1)
#define bigSize     (100000)

struct batcher
{
    int numSteps;
    //bigInt stopPoint;
    unsigned long long stopPoint;
};

struct to_ret
{
    //bigInt num;
    unsigned long long num;
    struct batcher to_batch;
};

//Stores values as [num][#steps to smaller val][smaller val]
static struct to_ret retlist[bigSize];
//Stores values as [#steps to smaller val][smaller val], sorted by num
static struct batcher results[bigSize];

int main ()
{
    int i;
    // more code here

    ////#pragma offload target(mic:0) inout(retlist) shared(retlist)
    {
        ////#pragma omp parallel for
        for(i = 1; i < bigSize; i++)
        {
            retlist[i].num = i + 1;
            //bigInt next = retlist[i].num;
            unsigned long long next = retlist[i].num;
            int count = 0;

            do
            {
                count++;

                if (next%2 == 1)
                    next=(3*next+1)/2;
                else
                    next/=2;

            } while(next > retlist[i].num);

            retlist[i].to_batch.numSteps = count;
            retlist[i].to_batch.stopPoint = next;
        }
    }

    ///Organizes data into a sorted array
    ////#pragma omp parallel for
    for (i = 0; i < bigSize; i++){
        results[retlist[i].num - 1] = retlist[i].to_batch;
    }
    // more code here

    return(0);
} // end function: main

I'm not convinced that all the points you make are improvements. Not overflowing the stack is clearly good, but the others are more debatable. — Jonathan Leffler, Dec 21 '14 at 07:17

Sam · Accepted Answer · 2014-12-22T05:25:41.833

The full code can be found on github here, and though while there are still a lot of efficiency issues with it (could use vectorization support), what I've currently landed upon is this (utilizing the suggestion by barak-manos):

typedef unsigned long long bigInt;

/// Number to test up to (starting from 1)
#define bigSize     1000000000 //340282366920938463463374607431768211455

typedef struct {
    int numSteps;
    bigInt stopPoint;
} batcher;

typedef struct {
    bigInt num;
    batcher to_batch;
} to_ret;

__attribute__((target(mic))) to_ret retlist[bigSize]; ///Stores values as [num][#steps to smaller val][smaller val]
__attribute__((target(mic))) batcher results[bigSize]; ///Stores values as [#steps to smaller val][smaller val] & is sorted by num


int main () {
    bigInt j;
    double start, end;

    retlist[0].num = 1; retlist[0].to_batch.numSteps = 0; retlist[0].to_batch.stopPoint = 1;
    start = omp_get_wtime();

    #pragma offload target(mic:0) out(results)
    {
        int count;
        bigInt i, next;

    #pragma omp parallel for
        for(i = 1; i < bigSize; i++){
            next = retlist[i].num = i + 1;
            count = 0;

            do {
                count++;

                if (next%2 == 1)
                    next=(3*next+1)/2;
                else
                    next/=2;

            } while(next > retlist[i].num);

            retlist[i].to_batch.numSteps = count;
            retlist[i].to_batch.stopPoint = next;
        }

    ///Organizes data into a sorted array
    #pragma omp parallel for
    for (i = 0; i < bigSize; i++){
        results[i] = retlist[i].to_batch;
    }
  }
...

    for(j = 0; j < bigSize; j++){
        results[j].numSteps += results[results[j].stopPoint-1].numSteps;
    }

    return(0);
}

If anyone's interested, please feel free to create a fork of my project.

Segmentation fault with high values (Xeon Phi)

2 Answers2