ICC reduction is slow and produce wrong results

Question

I am trying to write a simple reduction code for Xeon Phi co-processor using Intel Compiler (ICC). However, my code has two problems: the first problem is that it produce wrong result and it is slower than the serial solution. I compiled the code with this options (mpi_reduce.c is the name of file):

icc mpi_reduce.c -V -openmp -o mpi_reduce.out

Here it is my code:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>


float * _Cilk_shared data; //pointer to “shared” memory

_Cilk_shared float MIC_OMPReduction(int size)
{
    int i;
    #ifdef __MIC__
    float Result;
    int nThreads = 32;
    omp_set_num_threads(nThreads);
    #pragma omp parallel for reduction(+:Result)
    for (i=0; i<size; ++i)
    {
        Result += data[i];
    }
    return Result;
    #else
        printf("Intel(R) Xeon Phi(TM) Coprocessor not available\n");
    #endif
    return 0.0f;
}

float reduction_serial(int size)
{
    float ret = 0;
    int i;
    for ( i=0; i<size; ++i)
    {
        ret += data[i];
    }
    return ret;
}


int main()
{
    struct timeval tv1, tv2;
    int sec, usec;
    int i;
    float result_serial, result_parallel;
    size_t size = 1*1e6;
    int n_bytes = size*sizeof(float);
    data = (_Cilk_shared float *)_Offload_shared_malloc (n_bytes);
    printf("begin computation for size: %d \n",size);
    for (i=0; i<size; ++i)
    {
        data[i] = i%10;
    }

    gettimeofday(&tv1,NULL);
    result_serial = reduction_serial(size);
    gettimeofday(&tv2,NULL);
    sec = (int) (tv2.tv_sec-tv1.tv_sec);
    usec = (int) (tv2.tv_usec-tv1.tv_usec);
    if (usec < 0){
        sec--;
        usec += 1000000;
    }
    printf("reduction_serial: %f sec\n",sec+usec/1000000.0);
    printf("reduction_serial result = %f \n",result_serial);

    gettimeofday(&tv1,NULL);
    result_parallel = _Cilk_offload MIC_OMPReduction(size);
    gettimeofday(&tv2,NULL);
    sec = (int) (tv2.tv_sec-tv1.tv_sec);
    usec = (int) (tv2.tv_usec-tv1.tv_usec);
    if (usec < 0){
        sec--;
        usec += 1000000;
    }
    printf("reduction_parallel: %f sec\n",sec+usec/1000000.0);
    printf("reduction_parallel result = %f \n",result_parallel);

    _Offload_shared_free(data);

    return 0;
}

And here it is the output of my code:

begin computation for size: 1000000
reduction_serial: 0.000239 sec
reduction_serial result = 4500000.000000
reduction_parallel: 0.461872 sec
reduction_parallel result = 4513334.000000

I noticed that when I compile the code without -openmp option the result of the parallel code is right, however with use of -openmp option the result is wrong.

You did not initialized `Result` to zero. Compile with `-Wall`. — Z boson, Jan 02 '16 at 09:20
I don't know what `_Offload_shared_malloc` does but if the array is still in memory then I would imagine there would be significant overhead for the KNC to read from main memory. You need to have the array in main memory for the serial test and in the memory of KNC for the parallel test to make a fair test. — Z boson, Jan 02 '16 at 09:28
Offloading data to KNC is **slow**. The only reasonable strategy is to implement chunking with tripple-buffering: part of the data is being processed by the co-processor while the already processed chunks are being downloaded to the host while new chunks are being uploaded. — Hristo Iliev, Jan 02 '16 at 22:26
@Zboson it worked.... I don't know how... but it worked... thanks buddy — Hamid_UMB, Jan 03 '16 at 18:03

ICC reduction is slow and produce wrong results

0 Answers0