NO.
Each element of the output depends on ALL elements of the input vector x
For example: if x
is the input and y
is the output, A
is the matrix,
The i
th element of y
would be generated in the following manner.
y_i = A_i1*x_1 + A_i2 * x_2 ... + A_in * x_n
So if you over-write x_i
with the result from above, some other x_r
which depends on x_i
will not receive the proper input and produce improper results.
EDIT
I was going to make this a comment, but it was getting too big. So here is the explanation why the above reasoning holds good for parallel implementations too.
Unless each parallel group / thread makes a local copy of the original data, in which case the original data can be destroyed, this line of reasoning holds.
However, doing so (making a local copy) is only practical and beneficial when
- Each parallel thread / block would not be able to access the
original array without significant amount of over-head.
- There is enough local memory (call it cache, or shared memory or even
regular memory in case of MPI) to hold a separate copy for each
parallel thread / block.
Notes:
- (1) may not be true for many multi-threaded applications on a single machine.
- (1) may be true for CUDA but (2) is definitely not applicable for CUDA.