0

UPDATE: the (sparse) three-dimensional matrix v in my question below is symmetric: v(i1,i2,i3) = v(j1,j2,j3) where (j1,j2,j3) is any of the 6 permutations of (i1,i2,i3), i.e.

v(i1,i2,i3) = v(i1,i3,i2) = v(i2,i3,i1) = v(i2,i1,i3) = v(i3,i1,i2) = v(i3,i2,i1).

Moreover, v(i1,i2,i3) != 0 only when i1 != i2 && i1 != i3 && i2 != i3.

E.g. v(i,i,j) = 0, v(i, k, k) = 0, v(k, j, k) = 0, etc...

I thought that with this additional information, I could already get a significant speed-up by doing the following:

  • Remark: v contains duplicate values (a triplet (i,j,k) has 6 permutations, and the values of v for these 6 are the same).
  • So I defined a more compact matrix uthat contains only non-duplicates of v. The indices of u are (i1,i2,i3) where i1 < i2 < i3. The length of u is equal to the length of v divided by 6.
  • I computed the sum by iterating over the new value vector and the new index vectors.

With this, I only got a little speed-up. I realized that instead of iterating N times doing a multiplication each time, I iterated N/6 times doing 6 multiplications each time, and that's pretty much the same as before :(

Hope somebody could come up with a better solution.

--- (Original question) ---

In my program I have an expensive operation that is repeated every iteration.

I have three n-dimensional vectors x1, x2 and x3 that are supposed to change every iteration.

I have four N-dimensional vectors I1, I2, I3 and v that are pre-defined and will not change, where:

  • I1, I2 and I3 contain the indices of respectively x1, x2 and x3 (the elements in I_i are between 0 and n-1)
  • v is a vector of values.

For example:

enter image description here

We can see v as a (reshaped) sparse three-dimensional matrix, each index k of v corresponds to a triplet (i1,i2,i3) of indices of x1, x2, x3.

I want to compute at each iteration three n-dimensional vectors y1, y2 and y3 defined by:

y1[i1] = sum_{i2,i3} v(i1,i2,i3)*x2(i2)*x3(i3)

y2[i2] = sum_{i1,i3} v(i1,i2,i3)*x1(i1)*x3(i3)

y3[i3] = sum_{i1,i2} v(i1,i2,i3)*x1(i1)*x2(i2)

More precisely what the program does is:

Repeat:

  • Compute y1 then update x1 = f(y1)
  • Compute y2 then update x2 = f(y2)
  • Compute y3 then update x3 = f(y3)

where f is some external function.

I would like to know if there is a C++ library that helps me to do so as fast as possible. Using for loops is just too slow.

Thank you very much for your help!

Update: Looks like it's not easy to get a better solution than the straight-forward for loops. If the vector of indices I1 above is ordered in non-decreasing order, can we compute y1 faster? For example: I1 = [0 0 0 0 1 1 2 2 2 3 3 3 ... n n].

f10w
  • 1,524
  • 4
  • 24
  • 39
  • A better place to ask for [software library recommendations](http://softwarerecs.stackexchange.com/) is SoftwareRecs.StackExchange.com. – Thomas Matthews Oct 31 '16 at 16:35
  • What order of magnitude are `n` and `N`? And if you just calculated, e.g., `temp[i1,i2]=sum_{i3} v(i1,i2,i3)*x3(i3)` how sparse would that intermediate result be? (this would be a sub-expression you could use for computing both `y2` and `y3`). – chtz Nov 04 '16 at 14:53
  • @chtz : Thanks. Approximately, N = 3e7 and n = 3e3. I haven't checked for the sparseness of `temp[i1,i2]`, but what will happen if it's very sparse or not quite sparse? I also tried to do this before asking the above question: compute and store `temp[i1,i2]` in a dense Eigen matrix at each iteration, then use it for computing both `y1` and `y2` using Eigen matrix multiplication. However, it's slower than my original solution (i.e. computing using straight-forward for loops for `yM`). – f10w Nov 04 '16 at 15:12
  • @chtz : In addition, if the vector of indices `I1` above is ordered in non-decreasing order, can we compute `y1` faster? Please see the update at the end of the question. Thanks again for your help! – f10w Nov 04 '16 at 15:19
  • Assuming the indexes are distributed somewhat evenly, you have about 1e4 entries for each index of I1, I guess? And for each pair (i1, i2) you will have about 3 entries in V on average (so the result would be quite dense). If you already tried storing that into a dense matrix without success, it will not be worth it. – chtz Nov 04 '16 at 16:20
  • If entries in `I1` are sorted, you could try to store `v` in some sort of compressed-column format, i.e. don't store each `i1`, but just the start and end index of each block. When calculating `y1` accumulate `v(i1, i2, i3)*x2(i2)*x3(i3)` by first gathering all required elements into a dense (number of entries where `i1` is same)-sized array each and do a dense (and fast) elementwise product with summation afterwards. Is there some kind of symmetry how the indexes of `v` are distributed? – chtz Nov 04 '16 at 16:29
  • @chtz The indices of `I1` are not distributed evenly, unfortunately (I think I can make the numbers of each index equal by adding the missing indices with value v = 0). I have also been thinking of doing what you have suggested, i.e.: expanding `xM` according to `iM`, doing element-wise multiplication, and then doing a reduction/summation of the values of the same indices. – f10w Nov 04 '16 at 16:54
  • @chtz Oh yes thanks for asking about the symmetry. I have checked the data and it's actually symmetric: `v(i1,i2,i3) = v(j1,j2,j3)` where `(j1,j2,j3)` is any of the 6 permutations of `(i1,i2,i3)`, i.e. `v(i1,i2,i3) = v(i1,i3,i2) = v(i2,i3,i1) = v(i2,i1,i3) = v(i3,i1,i2) = v(i3,i2,i1)`. I think I could already get a significant speed-up even using for loops. – f10w Nov 04 '16 at 16:57
  • @chtz I was wrong, it's not that easy. Please see the update at the top of the question. Thanks for your help! – f10w Nov 04 '16 at 22:57

1 Answers1

0

The simple answer is no, at least, not trivially. Your access pattern (e.g. x2(i2)*x3(i3)) does not (at least at compile time) access contiguous memory, but rather has a layer of indirection. Due to this, SIMD instructions are pretty useless, as they work on chunks of memory. What you may want to consider doing is creating a copy of xM sorted according to iM, removing the layer of indirection. This should reduce the number of cache misses in that xM(iM) generates and since it's accessed N times, that may reduce some of the wall time (assuming N is large).

If maximal accuracy is not critical, you may want to consider using a FFT method instead of the convolution (at least, that's how I understood your question. Feel free to correct me if I'm wrong).

Assuming you are doing a convolution and the vectors (a and b, same size as in your question) are large, the result (c) can be calculated naïvely as

// O(n^2)
for(int i = 0; i < c.size(); i++)
    c(i) = a(i) * b.array();

Using the convolution theorem, you could take the Fourier transform of both a and b and perform an element wise multiplication and then take the inverse Fourier transform of the result to get c (will probably differ a little):

// O(n log(n)); note A, B, and C are vectors of complex floating point numbers
fft.fwd(a, A);
fft.fwd(b, B);
C = A.array() * B.array();
fft.inv(C, c);
mike
  • 4,929
  • 4
  • 40
  • 80
Avi Ginsburg
  • 10,323
  • 3
  • 29
  • 56
  • Thanks. I guess that to minimize the cost, I have to iterate over `N` as few times as possible (if no vectorization is possible). Currently I only do that once for each `yM` (I cannot compute `y1`, `y2` and `y3` at once because I have to update `xM` in-between -- please see the updated question). So I cannot see how creating a copy of `xM`, as you have suggested, could improve the speed, because by creating them I already iterate over `N`? (Sorry I'm new to these kinds of things). And could you please give more details on how FFT can be applied to my case? It sounds very interesting. – f10w Oct 31 '16 at 18:23
  • Read e.g. [this](http://stackoverflow.com/questions/18384054/what-are-the-downsides-of-convolution-by-fft-compared-to-realspace-convolution). If I have time later, I may write up an example in my answer. – Avi Ginsburg Nov 01 '16 at 06:06
  • I don't think that FFT will help for this. The problem looks like a straight-forward (sparse) (Tensor*Vector)*Vector product to me. If you have AVX2 available this may even be partially vectorizable using gather instructions (which is not supported by Eigen, though). – chtz Nov 04 '16 at 14:51