2

I'm using a version of openMP which does not support reduce() for complex argument. I need a fast dot-product function like

std::complex< double > dot_prod( std::complex< double > *v1,std::complex< double > *v2,int dim )

{
    std::complex< double > sum=0.;
    int i;
# pragma omp parallel shared(sum)
# pragma omp for
    for (i=0; i<dim;i++ )
    {
#pragma omp critical
        {
            sum+=std::conj<double>(v1[i])*v2[i];
        }
    }
    return sum;
}

Obviously this code does not speed up the problem but slows it down. Do you have a fast solution without using reduce() for complex arguments?

pawel_winzig
  • 902
  • 9
  • 28
  • 1
    note that putting the content of the parallelized for-loop in a critical section is like (or even worse) not doing parallelization at all – CharlesB Jun 06 '11 at 10:25

3 Answers3

4

Each thread can calculate the private sum as the first step and as the second step it can be composed to the final sum. In that case the critical section is only needed in the final step.

std::complex< double > dot_prod( std::complex< double > *v1,std::complex< double > *v2,int dim )
{
  std::complex< double > sum=0.;
  int i;
  # pragma omp parallel shared(sum)
  {
    std::complex< double > priv_sum = 0.;
    # pragma omp for
    for (i=0; i<dim;i++ )
    {
      priv_sum += std::conj<double>(v1[i])*v2[i];
    }

    #pragma omp critical
    {
      sum += priv_sum;
    }
  }
  return sum;
}
Michy
  • 623
  • 9
  • 15
1

Try doing the multiplications in parallel, then sum them serially:

template <typename T>
std::complex<T> dot_prod(std::complex<T> *a, std::complex<T> *b, size_t dim)
{
    std::vector<std::complex<T> > prod(dim);  // or boost::scoped_array + new[]

    #pragma omp parallel for
    for (size_t i=0; i<dim; i++)
        // I believe you had these reversed
        prod[i] = a[i] * std::conj(b[i]);

    std::complex<T> sum(0);
    for (size_t i=0; i<dim; i++)
        sum += prod[i];

    return sum;
}

This does require O(dim) working memory, of course.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • thanks for the fast answer! That's a good idea, but unfortunately it seam so that for vectors of dim=10^7 (which I tested) its still 3x faster not to use openMP in this way but to use the un-parallelized version. I also outsourced the prod generation, but this doesn't change the situation. – pawel_winzig Jun 06 '11 at 10:50
  • @pawel: with so large a dataset, you might want to write a reduce algorithm yourself, as recommended by @Christian Rau. Or grab a better compiler that supports OpenMP `reduce` :) – Fred Foo Jun 06 '11 at 11:20
  • 1
    @pawel_winzig: As dot product of large vectors tends to be limited by memory bandwidth rather than flops, that's not particularly surprising. – janneb Jun 06 '11 at 11:24
0

Why not have the N threads compute N individual sums. Then at the end you only need to sum the N sums, which can be done serially, as N is quite small. Although I don't know how to accomplish that with OpenMP, at the moment (I don't have any experience with it), I'm quite sure this is easily achievable.

Christian Rau
  • 45,360
  • 10
  • 108
  • 185