0

I am currently doing matrix transpose with C. In my algorithm there are a lot of swap operations. I need to exchange the two double precision number that two (double*) points points to.

 void transposenRightHalf(double *m, int size){
    double temp;
    for (int i  = 0; i < size-1; i++) {
        for (int j = i+1; j < size; j++) {
            temp = *(m+i*size+j);
            *(m+i*size+j) = *(m+j*size+i);
            *(m+j*size+i) = temp;
        }
    }
 }

Since I am doing this on a cray machine, which uses x86 architecture, I am trying to use inline assemble to do the swap operation. I did some searching but could not find one. I really need some help.

user94602
  • 883
  • 1
  • 7
  • 8
  • 1
    What makes you believe there's a problem? My x86 machine has XMM registers and my compiler does something very obvious that I could not imagine doing better myself. – Kerrek SB Feb 13 '14 at 01:38
  • How come you are passing a pointer to a double instead of pointer to an array of doubles? Why don't you use the cray library function **TRANSPOSE (matrix)**, it is probably already highly optimized? – Marichyasana Feb 13 '14 at 03:59
  • @ Marichyasana. Thank cfor the comment. it is a class project. we are not allowed to used these functions. – user94602 Feb 13 '14 at 11:59
  • @Tavian Barnes thanks! it is a type. – user94602 Feb 13 '14 at 12:00

1 Answers1

0

AVX2 gather instructions might offer some opportunity for parallelization.

The operation is bottlenecked by memory bandwidth, so you need to think about how to arrange your memory accesses to make the best use of cache. Doing the transpose in blocks instead of one row at a time will greatly increase the locality of your memory accesses. Watch out for cache associativity limitations that may make your cache act unexpectedly small if the stride of your accesses is wrong (though at the worst, this would degrade back to your current performance).

user57368
  • 5,675
  • 28
  • 39