Questions tagged [cublas]

The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library for use with CUDA capable GPUs.

The CUBLAS library is an implementation of the standard BLAS (Basic Linear Algebra Subprograms) API on top of the NVIDIA CUDA runtime.

Since CUDA 4.0 was released, the library contains implementations of all 152 standard BLAS routines, supporting single precision real and complex arithmetic on all CUDA capable devices, and double precision real and complex arithmetic on those CUDA capable devices with double precision support. The library includes host API bindings for C and Fortran, and CUDA 5.0 introduces a device API for use with CUDA kernels.

The library is shipped in every version of the CUDA toolkit and has a dedicated homepage at http://developer.nvidia.com/cuda/cublas.

330 questions
5
votes
1 answer

CMake 3.11 Linking CUBLAS

How do I correctly link to CUBLAS in CMake 3.11? In particular, I'm trying to create a CMakeLists file for this code. CMakeLists file so far: cmake_minimum_required(VERSION 3.8 FATAL_ERROR) project(cmake_and_cuda LANGUAGES CXX…
Armin Meisterhirn
  • 801
  • 1
  • 13
  • 26
5
votes
2 answers

Reducing matrix rows or columns in CUDA

I'm using CUDA with cuBLAS to perform matrix operations. I need to sum the rows (or columns) of a matrix. Currently I'm doing it by multiplying the matrix with a ones vector but this doesn't seem so efficient. Is there any better way? Couldn't find…
Ran
  • 4,117
  • 4
  • 44
  • 70
5
votes
4 answers

How to transpose a matrix in CUDA/cublas?

Say I have a matrix with a dimension of A*B on GPU, where B (number of columns) is the leading dimension assuming a C style. Is there any method in CUDA (or cublas) to transpose this matrix to FORTRAN style, where A (number of rows) becomes the…
Hailiang Zhang
  • 17,604
  • 23
  • 71
  • 117
5
votes
1 answer

cuBLAS argmin -- segfault if outputing to device memory?

In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int…
solvingPuzzles
  • 8,541
  • 16
  • 69
  • 112
5
votes
3 answers

Retaining dot product on GPGPU using CUBLAS routine

I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only…
user1439690
  • 659
  • 1
  • 11
  • 26
5
votes
2 answers

Finding maximum and minimum with CUBLAS

I'm having problems grasping why my function that finds maximum and minimum in a range of doubles using CUBLAS doesn't work properly. The code is as follows: void findMaxAndMinGPU(double* values, int* max_idx, int* min_idx, int n) { double*…
ssnielsen
  • 525
  • 5
  • 15
4
votes
1 answer

cublas cublasZgemm() slower than expected

According to nvidia. cublasZgemm is 6x faster than intel MKL. However, on my PC (i7 2600, Nvidia gtx560, OS:linux 64bit), cublasZgemm is slightly slower than MKL. I use the numpy.dot() that come with enthought python distribution which links numpy…
lucas peng
  • 43
  • 3
4
votes
1 answer

typecasting in CUDA and cuBLAS

I am writing a program in cuda and I am trying to reduce the overhead of the data transfer. I use cuBLAS library for matrix multiplications and I have to send 30.000.000 numbers, whose values range from 0-255. Right now I'm sending them as floats,…
STE
  • 656
  • 3
  • 8
  • 18
4
votes
1 answer

Is it possible to call cuBLAS or cuBLASLt functions from CUDA 10.1 kernels?

Concerning CUDA 10.1 I'm doing some calculations on geometric meshes with a large amount of independent calculations done per face of the mesh. I run a CUDA kernel which does the calculation for each face. The calculations involve some matrix…
4
votes
3 answers

CUDA - Simple matrix addition/sum operation

This should be very simple but I could not find an exhaustive answer: I need to perform A+B = C with matrices, where A and B are two matrices of unknown size (they could be 2x2 or 20.000x20.000 as greatest value) Should I use CUBLAS with Sgemm…
Paul
  • 43
  • 3
4
votes
1 answer

Strange cuBLAS gemm batched performance

I am noticing some strange performance of cublasSgemmStridedBatched, and I am looking for a explaination. The matrix size is fixed at 20x20. Here are some timings (only the multiply, no data transfer) for a few different batch sizes: batch = 100,…
qtqt
  • 51
  • 5
4
votes
1 answer

How to make multi CUBLAS APIs (eg. cublasDgemm) really execute concurrently in multi cudaStream

I want to make two CUBLAS APIs(eg.cublasDgemm) really execute concurrently in two cudaStreams. As we know, the CUBLAS API is asynchronous,level 3 routines like cublasDgemm don't block the host,that means the following codes (in default cudaStream)…
4
votes
1 answer

Using cuBLAS-XT for large input size

This link says cuBLAS-XT routines provide out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size. This means that as long as input data can be stored on CPU memory and size of output…
starrr
  • 1,013
  • 1
  • 17
  • 48
4
votes
1 answer

Profiling cublas applications

I'm trying to profile my application that uses cuBLAS exclusively with Nvidia Visual Profiler on Windows, however it shows that there's no GPU usage in my application at all! That is, the timeline is completely empty except for profiling overhead. …
Andrew
  • 867
  • 7
  • 20
4
votes
1 answer

CUBLAS: Incorrect inversion for matrix with zero pivot

Since CUDA 5.5, the CUBLAS library contains routines for batched matrix factorization and inversion (cublasgetrfBatched and cublasgetriBatched respectively). Getting guide from the documentation, I wrote a test code for inversion of an N x N…
sgarizvi
  • 16,623
  • 9
  • 64
  • 98
1 2
3
21 22