Questions tagged [cublas]

The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library for use with CUDA capable GPUs.

The CUBLAS library is an implementation of the standard BLAS (Basic Linear Algebra Subprograms) API on top of the NVIDIA CUDA runtime.

Since CUDA 4.0 was released, the library contains implementations of all 152 standard BLAS routines, supporting single precision real and complex arithmetic on all CUDA capable devices, and double precision real and complex arithmetic on those CUDA capable devices with double precision support. The library includes host API bindings for C and Fortran, and CUDA 5.0 introduces a device API for use with CUDA kernels.

The library is shipped in every version of the CUDA toolkit and has a dedicated homepage at http://developer.nvidia.com/cuda/cublas.

330 questions

votes

1 answer

CMake 3.11 Linking CUBLAS

How do I correctly link to CUBLAS in CMake 3.11? In particular, I'm trying to create a CMakeLists file for this code. CMakeLists file so far: cmake_minimum_required(VERSION 3.8 FATAL_ERROR) project(cmake_and_cuda LANGUAGES CXX…

cmake cuda cublas

asked Jul 08 '18 at 23:56

Armin Meisterhirn

votes

2 answers

Reducing matrix rows or columns in CUDA

I'm using CUDA with cuBLAS to perform matrix operations. I need to sum the rows (or columns) of a matrix. Currently I'm doing it by multiplying the matrix with a ones vector but this doesn't seem so efficient. Is there any better way? Couldn't find…

cuda cublas

asked Jan 10 '13 at 15:07

Ran

4,117
4
44
70

votes

4 answers

How to transpose a matrix in CUDA/cublas?

Say I have a matrix with a dimension of A*B on GPU, where B (number of columns) is the leading dimension assuming a C style. Is there any method in CUDA (or cublas) to transpose this matrix to FORTRAN style, where A (number of rows) becomes the…

c parallel-processing cuda gpu cublas

asked Dec 08 '12 at 21:15

Hailiang Zhang

17,604
23
71
117

votes

1 answer

cuBLAS argmin -- segfault if outputing to device memory?

In cuBLAS, cublasIsamin() gives the argmin for a single-precision array. Here's the full function declaration: cublasStatus_t cublasIsamin(cublasHandle_t handle, int n, const float *x, int incx, int…

cuda gpu gpgpu blas cublas

asked Dec 05 '12 at 07:24

solvingPuzzles

8,541
16
69
112

votes

3 answers

Retaining dot product on GPGPU using CUBLAS routine

I am writing a code to compute dot product of two vectors using CUBLAS routine of dot product but it returns the value in host memory. I want to use the dot product for further computation on GPGPU only. How can I make the value reside on GPGPU only…

cuda gpgpu cublas dot-product

asked Sep 13 '12 at 06:18

user1439690

votes

2 answers

Finding maximum and minimum with CUBLAS

I'm having problems grasping why my function that finds maximum and minimum in a range of doubles using CUBLAS doesn't work properly. The code is as follows: void findMaxAndMinGPU(double* values, int* max_idx, int* min_idx, int n) { double*…

c++ c cuda cublas

asked Apr 25 '12 at 09:00

ssnielsen

votes

1 answer

cublas cublasZgemm() slower than expected

According to nvidia. cublasZgemm is 6x faster than intel MKL. However, on my PC (i7 2600, Nvidia gtx560, OS:linux 64bit), cublasZgemm is slightly slower than MKL. I use the numpy.dot() that come with enthought python distribution which links numpy…

python cuda ctypes cublas

asked Mar 04 '12 at 13:29

lucas peng

votes

1 answer

typecasting in CUDA and cuBLAS

I am writing a program in cuda and I am trying to reduce the overhead of the data transfer. I use cuBLAS library for matrix multiplications and I have to send 30.000.000 numbers, whose values range from 0-255. Right now I'm sending them as floats,…

casting cuda cublas

asked Feb 05 '12 at 22:55

STE

votes

1 answer

Is it possible to call cuBLAS or cuBLASLt functions from CUDA 10.1 kernels?

Concerning CUDA 10.1 I'm doing some calculations on geometric meshes with a large amount of independent calculations done per face of the mesh. I run a CUDA kernel which does the calculation for each face. The calculations involve some matrix…

c++ visual-studio cuda cublas

asked Aug 06 '19 at 07:45

Victor Nordam Suadicani

votes

3 answers

CUDA - Simple matrix addition/sum operation

This should be very simple but I could not find an exhaustive answer: I need to perform A+B = C with matrices, where A and B are two matrices of unknown size (they could be 2x2 or 20.000x20.000 as greatest value) Should I use CUBLAS with Sgemm…

matrix cuda sum cublas

asked Mar 24 '11 at 16:06

Paul

votes

1 answer

Strange cuBLAS gemm batched performance

I am noticing some strange performance of cublasSgemmStridedBatched, and I am looking for a explaination. The matrix size is fixed at 20x20. Here are some timings (only the multiply, no data transfer) for a few different batch sizes: batch = 100,…

cuda gpu gpgpu cublas

asked Jan 30 '18 at 11:04

qtqt

votes

1 answer

How to make multi CUBLAS APIs (eg. cublasDgemm) really execute concurrently in multi cudaStream

I want to make two CUBLAS APIs(eg.cublasDgemm) really execute concurrently in two cudaStreams. As we know, the CUBLAS API is asynchronous,level 3 routines like cublasDgemm don't block the host,that means the following codes (in default cudaStream)…

concurrency cuda cublas cuda-streams

asked Dec 30 '16 at 03:40

Yangsong Zhang

votes

1 answer

Using cuBLAS-XT for large input size

This link says cuBLAS-XT routines provide out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size. This means that as long as input data can be stored on CPU memory and size of output…

cuda cublas

asked Nov 05 '16 at 15:41

starrr

1,013
1
17
48

votes

1 answer

Profiling cublas applications

I'm trying to profile my application that uses cuBLAS exclusively with Nvidia Visual Profiler on Windows, however it shows that there's no GPU usage in my application at all! That is, the timeline is completely empty except for profiling overhead. …

c cuda profiling nvidia cublas

asked Oct 15 '14 at 21:32

Andrew

votes

1 answer

CUBLAS: Incorrect inversion for matrix with zero pivot

Since CUDA 5.5, the CUBLAS library contains routines for batched matrix factorization and inversion (cublasgetrfBatched and cublasgetriBatched respectively). Getting guide from the documentation, I wrote a test code for inversion of an N x N…

cuda matrix-inverse cublas

asked Apr 05 '14 at 21:39

sgarizvi

16,623
9
64
98

Prev 1 2

…

21 22 Next