CUBLAS does not wrap around BLAS.
CUBLAS also accesses matrices in a column-major ordering, such as some Fortran codes and BLAS.
I am more used to writing code in C, even for CUDA.
A code written with CBLAS (which is a C wrap of BLAS) can easily be change into a CUDA code.
Be aware that Fortran codes that use BLAS are quite different from C/C++ codes that use CBLAS.
Fortran and BLAS normally store matrices or double arrays in column-major ordering,
but C/C++ normally handle Row-major ordering.
I normally handle this problem writing saving the matrices in a 1D arrays,
and use #define to write a macro toa access the element i,j of a matrix as:
/* define macro to access Aij in the row-wise array A[M*N] */
#define indrow(ii,jj,N) (ii-1)*N+jj-1 /* does not depend on rows M */
/* define macro to access Aij in the col-wise array A[M*N] */
#define indcol(ii,jj,M) (jj-1)*M+ii-1
CBLAS library has a well organize parameters and conventions (const enum variables)
to give to each function the ordering of the matrix.
Beware that also the storage of matrices vary, a row-wise banded matrix is not stored the same as a column-wise band matrix.
I don't think there are mechanics to allow the user to choose between using BLAS or CUBLAS,
without writing the code twice.
CUBLAS also has on most function calls a "handle" variable that does not appear on BLAS.
I though of #define to change the name at each function call, but this might not work.