Questions tagged [cublas]

The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library for use with CUDA capable GPUs.

The CUBLAS library is an implementation of the standard BLAS (Basic Linear Algebra Subprograms) API on top of the NVIDIA CUDA runtime.

Since CUDA 4.0 was released, the library contains implementations of all 152 standard BLAS routines, supporting single precision real and complex arithmetic on all CUDA capable devices, and double precision real and complex arithmetic on those CUDA capable devices with double precision support. The library includes host API bindings for C and Fortran, and CUDA 5.0 introduces a device API for use with CUDA kernels.

The library is shipped in every version of the CUDA toolkit and has a dedicated homepage at http://developer.nvidia.com/cuda/cublas.

330 questions
-1
votes
1 answer

Computes Matrix A.transpose*A in cuda

there are some problem when I'm computing `A.transpose*A' in cuda. Suppose A is M*N matrix and stored in column-major, and I try to use this function cublasSgemm_v2 which is the Matrix-Matrix Multiplication API in cublas like this…
Zziggurats
  • 165
  • 1
  • 4
  • 12
-1
votes
1 answer

How can I find row to all rows distance matrix between two matrices W and X in Thrust or Cublas?

I have following matlab code; tempx = full(sum(X.^2, 2)); tempc = full(sum(C.^2, 2).'); D = -2*(X * C.'); D = bsxfun(@plus, D, tempx); D = bsxfun(@plus, D, tempc); where X is nxm and W is kxm matrices realtively. One is the data and the other is…
erogol
  • 13,156
  • 33
  • 101
  • 155
-1
votes
1 answer

Special Case of Matrix multiplication Using CUDA

I am searching for some special functions (CUDA) that dedicate to typical dense matrix multiplications, e.g. A*B, where the size of A is 6*n, the size of B is n*6 and n is very large (n=2^24). I have utilized CUBLAS and some other libraries to test…
-1
votes
1 answer

Can you use cublasDdot() to use blas operations in non-GPU memory?

So I have a code that performs matrix multiplicaiton, but the problem is it returns just zeroes when I use the library -lcublas and the compiler nvcc; however, the code runs great with just a few tweaks to function names when I use the compiler, g++…
Mechy
  • 259
  • 1
  • 4
  • 14
-2
votes
1 answer

CublasLt cublasLtMatmulAlgoGetHeuristic returns CUBLAS_STATUS_INVALID_VALUE for rows major matrix

I've just finished to refactor my program to use cublasLt lib for GEMM and I fell into a CUBLAS_STATUS_INVALID_VALUE when executing cublasLtMatmulAlgoGetHeuristic in the function below. CudaMatrix.cu:product /** * Performs the matrix-matrix…
-2
votes
1 answer

Optimize vector matrix multiplication in cuda with large number of zeros

I am using the following kernel to optimize vector-matrix multiplication for the case where both the vector and the matrix have a large number of zeros. The use of this kernel may reduce the time taken for such a multiplication by up to half of the…
malang
  • 33
  • 9
-2
votes
1 answer

Impact of matrix sparsity on cblas sgemm in Ubuntu 14.04

I have recently discovered that the performance of a cblas_sgemm call for matrix multiplication dramatically improves if the matrices have a "large" number of zeros in them. It improves to the point that it beats its cublas cousin by around 100…
malang
  • 33
  • 9
-2
votes
1 answer

How to call existing host function from device function in cuda

I have seen a similar question here However,I could not get an exact answer here, and it is written in 2012. I am trying to call cublasStatus_t cublasSgbmv(...) function, which is defined in "cublas_v2.h", in a __global__ function. However, I…
balik
  • 33
  • 10
-2
votes
1 answer

CUDA Library for Computing Kronecker Product

I have an application that requires me to calculate some large Kronecker products of 2D matrices and multiply the result by large 2D matrices. I would like to implement this on a GPU in CUDA and would prefer to use a tuned library implementation…
Michael Puglia
  • 145
  • 2
  • 9
-3
votes
1 answer

use threads for cublas calls from kernel?

BEFORE reading below! : As I have understand , when you call cublas from the kernel : cublas calls are kernels themselves the threads and blocks are managed from the cublas calls a cublas call is launched by 1 thread ( and 1 block ) and then it…
George
  • 5,808
  • 15
  • 83
  • 160
-3
votes
1 answer

zero value from cublas function

I am trying to solve the Ax=b system using the conjugate gradient method. I am using the example from nvidia samples but instead of using the cusparseScsrmv function , I am using the cublasSgemv to perform the Ax. My problem is that the "dot"…
George
  • 5,808
  • 15
  • 83
  • 160
-3
votes
1 answer

Programming bsxfun at CUDA via CUBLAS or THRUST?

I have a vector V that has nx1 items and a matrix M that has nxm item. I want to sum V with all the columns of M with CUDA. Is there any method in THRUST or CUBLAS that can help me to get away the problem?
erogol
  • 13,156
  • 33
  • 101
  • 155
-4
votes
1 answer

How to create Fortran interface for type void ** ptr in code C

I am new to use Fortran, and for a c function like below: cudaError_t cudaMalloc (void** devPtr, size_t size) Allocates size bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The allocated memory is…
-5
votes
1 answer

cuda gemm transpose with numpy

I am wondering how the GEMM Transpose works. I have a matrix which I want to multiply and I want to multiple the sample matrix transposed. Such as A.T * A I have something like this, def bptrs(a): return…
NinjaGaiden
  • 3,046
  • 6
  • 28
  • 49
-5
votes
1 answer

cuda runtime api and dynamic kernel definition

Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time,…
melisgl
  • 308
  • 2
  • 13
1 2 3
21
22