Efficient implementation of indirect daxpy operation

Question

_axpy is a blas level one operation which implements following

for i = 1:n
    a[i] = a[i]-$\alpha$ b[i]

There are efficient implementation of such regular daxpy available through various blas libraries such as MKL.

In my case I want to implement following variant of daxpy operation which uses indirect addressing.

for i = 1:n
    a[ind1[i]] = a[ind1[i]]-$\alpha$ b[i]

where ind1 contains the index of elements of vector A , which needs to be updated. The information I have is that ind1 is an monotonous array i.e. $ind1[i]>ind[j] \forall i>j$.

I assume such computation arises very often in sparse linear algebra. Does anyone know of any efficient implementation of based on SSE/AVX for such routines.

If the `ind1` array contains contiguous runs, you may be able to do something to speed up the operation. If `ind1` is essentially arbitrary, then there’s almost nothing you can do to optimized this (except possibly for prefetching). SSE/AVX simply have no efficient means to do the necessary gather/scatter operations. — Stephen Canon, Dec 19 '13 at 17:38
It does contains contiguous runs but probably of small length 10-30. My guess is prefetching might work but I've been warned that manual prefetching might do more harm than any advantage — arbitUser1401, Dec 19 '13 at 17:42

score 0 · Answer 1 · answered Jul 09 '15 at 20:30

You could do movss, then 3 insrps, until you fill a vector, then do the math. Then scatter back to the locations? If index is 16 or 32bit, you can load multiple indices at once into a 64bit GP register, and shift + movzx to get array indices.

See for example https://github.com/pcordes/par2-asm-experiments/blob/master/asm-pinsrw.s. That function looks up 16bit GF16 longmultiply components, based on the high and low halves of 16bit words. So my indices are 8-bit, so I can get a lot in a single 64bit load.

If there's enough contiguous data in your indices to be worth a lot of branch mispredicts finding, then as @StephenCanon says, it could be worth just looking for runs and doing each contiguous chunk with SIMD.

Efficient implementation of indirect daxpy operation

1 Answers1