I'm building a kernel which among other things uses the Magma function magma_dgeqrf2_gpu to perform a QR factorization. This outputs the upper triangular matrix R into a general matrix d_A on the GPU device.
Without transferring d_A back to host (since I need the GPU for further operations), is there a lib way to just reduce or extract the d_A into an upper triangular matrix R on the device?