You're going to multiply 10 million big vectors by a huge matrix that is the same for all of them.
It would be fastest if all possible decision-making could be compiled-out ahead of time.
In other words, there are lots of index calculations and loop testing that would be identically repeated millions of times.
This sounds like a perfect case for pre-compilation:
Write a small program that would take as input your 200x200 matrix data values, and have it print out a piece of program text defining a function capable of inputting the input vector and outputting the result vector.
It could look something like this:
void multTheMatrixByTheVector(double a[200], double b[200]){
b[0] = 0
+ a[0] * <a constant, the value of mat[0][0]>
+ a[1] * <a constant, the value of mat[1][0]>
...
+ a[199] * <a constant, the value of mat[199][0]>
;
b[1] = 0
+ a[0] * <a constant, the value of mat[0][1]>
+ a[1] * <a constant, the value of mat[1][1]>
...
+ a[199] * <a constant, the value of mat[199][1]>
;
...
b[199] = etc. etc.
}
You see, that function will be around 40000 lines long, but a decent compiler should be able to handle it.
Of course, if any of the matrix elements are zero, i.e. there's some sparsity, you can omit those lines (or let the compiler optimizer do it).
To do this on CUDA or vectorized instructions, you'd have to modify it accordingly, but that should be do-able.
When you include that function in your main program, it should be able to run about as fast as the machine can go.
It's not wasting any cycles doing index calculations, loop testing, or multiplying by empty matrix cells.
Then if it takes 10ns per multiply and add, my back-of-the envelope says it should take 400 usec per vector, or 4000 seconds overall - a little over an hour.