-1

I have an application where reduce operations (like sum, max) on a large matrix are bottleneck. I need to make this as fast as possible. Are there vector instructions in mkl to do that?

Is there a special hardware unit to deal with it on xeon cpu, gpu or mic?

How are reduce operations implemented in these hardware in general?

talonmies
  • 70,661
  • 34
  • 192
  • 269
hrs
  • 487
  • 5
  • 18

3 Answers3

1

You can implement your own simple reductions using the KNC vpermd and vpermf32x4 instructions as well as the swizzle modifiers to do cross lane operations inside the vector units.

The C intrinsic function equivalents of these would be the mm512{mask}permute* and mm512{mask}swizzle* family.

However, I recommend that you first look at the array notation reduce operations, that already have high performance implementations on the MIC.

Look at the reduction operations available here and also check out this video by Taylor Kidd from Intel talking about array notation reductions on the Xeon Phi starting at 20mins 30s.

EDIT: I noticed you are also looking for CPU based solutions. The array notation reductions will work very well on the Xeon also.

amckinley
  • 629
  • 1
  • 7
  • 15
0

Turns out none of the hardware have reduce operation circuit built-in. I imagined a sixteen 17 bit adders attached to 128 bit vector register for reduce-sum operation. Maybe this is because no one has encountered a significant bottleneck with reduce operation. Well, the best solution i found is #pragma omp parallel for reduction in openmp. I am yet to test its performance though.

hrs
  • 487
  • 5
  • 18
0

This operation is going to be bandwidth-limited and thus vectorization almost certainly doesn't matter. You want the hardware with the most memory bandwidth. An Intel Xeon Phi processor has more aggregate bandwidth (but not bandwidth-per-core) than a Xeon processor.

Jeff Hammond
  • 5,374
  • 3
  • 28
  • 45