I have n (very large) independent linear systems (Ax = b_i). They all have the same A, but b_i is different for (i = 1, ..., n). I want to solve these n systems in parallel in CUDA.
I was thinking that it might be most efficient to do the LU factorization of A in the host and then copy the new A to the constant memory of GPU (because even if I do the LU in device, only one thread should do it and other threads will be idle. Besides, constant memory is faster). Is there any better way for this?
Another issue is that while all threads are solving their system at the same time with the same algorithm, they are all accessing the same place of memory (A[i]) at the same time, which is not coalesced. How can I optimize this ?