4

If A is an n x n matrix and x a vector of dimension n, is it then possible to pass x to GEMV as the argument to both the x and y parameter, with beta=0, to achieve the operation xAx ?

I'm specifically interested in the Cublas implementation, with C interface.

leftaroundabout
  • 117,950
  • 5
  • 174
  • 319

2 Answers2

6

No. And for Fortran it has nothing to do with the implementation - In Fortran it breaks the language standard to have aliased actual arguments for any subprogram as it breaks the language standard unless those arguments are Intent(In). Thus if the interface has dummy arguments that are Intent(Out), Intent(InOut) or have no Intent you should always use separate variables for the corresponding actual arguments when invoking the subprogram.

Ian Bush
  • 6,996
  • 1
  • 21
  • 27
  • 1
    That's interesting. I'm calling the functions through the Cublas C interface so it doesn't _quite_ apply to my situation, but I suppose if it is language-forbidden in Fortran I can't trust it to work with the Fortran-modelled C interface either. – leftaroundabout Mar 29 '12 at 13:07
  • @leftroundabout I don't think so. If the actual routine is a C code and is compiled by a C compiler, Fortran aliasing rules do not apply. They are made to ease reasoning about pointers inside the subroutine, not to bind the caller. – Vladimir F Героям слава Mar 30 '12 at 20:56
1

NO.

Each element of the output depends on ALL elements of the input vector x

For example: if x is the input and y is the output, A is the matrix, The ith element of y would be generated in the following manner.

y_i = A_i1*x_1 + A_i2 * x_2 ... + A_in * x_n

So if you over-write x_i with the result from above, some other x_r which depends on x_i will not receive the proper input and produce improper results.

EDIT

I was going to make this a comment, but it was getting too big. So here is the explanation why the above reasoning holds good for parallel implementations too.

Unless each parallel group / thread makes a local copy of the original data, in which case the original data can be destroyed, this line of reasoning holds.

However, doing so (making a local copy) is only practical and beneficial when

  1. Each parallel thread / block would not be able to access the original array without significant amount of over-head.
  2. There is enough local memory (call it cache, or shared memory or even regular memory in case of MPI) to hold a separate copy for each parallel thread / block.

Notes:

  • (1) may not be true for many multi-threaded applications on a single machine.
  • (1) may be true for CUDA but (2) is definitely not applicable for CUDA.
Pavan Yalamanchili
  • 12,021
  • 2
  • 35
  • 55
  • Quite intuitive. Though I doubt it's in general a good idea to reason about the behaviour of highly parallelized algorithms by looking at where the simplistic procedural implementation takes values from... – leftaroundabout Mar 29 '12 at 20:38