CUDA Library for Computing Kronecker Product

Question

I have an application that requires me to calculate some large Kronecker products of 2D matrices and multiply the result by large 2D matrices. I would like to implement this on a GPU in CUDA and would prefer to use a tuned library implementation for this, rather than writing my own (certainly suboptimal) Kronecker product. I have experience with CUDA, BLAS, LAPACK etc, but unfortunately there is no kron(A,B) function in the common GPU implementations (magma, cuBLAS, cula, etc).

I've searched for some solutions, but can't find a library that suits my needs. (The closest question on SO is parallel Kronecker tensor product on gpu using CUDA, but this looks like a custom solution to a special case, which won't suit my needs. I'm looking for Kronecker product that will work in the most general case.)

I have read that DGEMM in BLAS can be used to implement a Kronecker product. Is there a standard algorithm to implement a Kronecker product using DGEMM (or its single/complex variants)? It's seems to me that the only way would be to call DGEMM in a loop and tile the results into a larger matrix, which does not seem very efficient. Or, does anyone know another implementation or paper that might provide what I'm looking for?

The dot product of two vectors is a scalar. The Kronecker product of two vectors is a matrix. How could you *possibly* use a dot product routine (ie reduction/summation) to compute the Kronecker product (element wise multiplication)? This makes absolutely no mathematical sense.... — talonmies, Jan 17 '14 at 16:39
GEMM performs matrix-matrix ops, including multiplication. I made no reference to dot-products in my question. According to http://arxiv.org/abs/1304.7054, "BLAS level-3 operation GEMM is used in practice" to effect Kronecker products. I'll be honest in that I don't 100% understand what the paper means by that statement, but that is the point of this question. As I stated above, I'm just trying to figure out exactly which options are available to me (even if GEMM is a dead end), and Googling this (like the comment above) has provided little help. — Michael Puglia, Jan 17 '14 at 17:00
GEMM *is* a dot product. A matrix-matrix dot product. What you probably want is a rank-1 update (something like BLAS `ger`), but a kronecker product of a pair on nxn matrices would require n*n rank-1 updates to compute the full kronecker product. — talonmies, Jan 17 '14 at 17:06
GEMM performs matrix-matrix ops, including multiplication. I made no reference to dot-products in my question. According to arxiv.org/abs/1304.7054, "BLAS level-3 operation GEMM is used in practice" to effect Kronecker products. I'll be honest in that I don't 100% understand what the paper means by that statement, but that is the point of this question. As I stated above, I'm just trying to figure out exactly which options are available to me (even if GEMM is a dead end), and Googling this (like the comment above) has provided little help. — Michael Puglia, Jan 17 '14 at 17:16

score 3 · Answer 1 · answered Jan 20 '14 at 13:49

The paper you have linked to is exploiting the following identity

enter image description here

to eliminate the need for explicitly calculating the Kronecker product and replacing it with a level 3 BLAS gemm call instead. If your problem is a matrix equation then you can use gemm in this way, otherwise it is of no use to you.

The other identity which could potentially useful would be to calculate the Kronecker product using an outer product (rank 1 update in level 2 BLAS IIRC):

enter image description here

Note again that the ordering of the resulting matrix will not be the same as the Kronecker product of the matrices A and B.

I am not aware of a CUDA library for computing the true Kronecker product of a pair of arbitrary sized matrices. It should be a memory bound problem, so even a relatively naïve approach which coalesces loads and re-uses as much data as possible should get fairly close to peak bandwidth.

CUDA Library for Computing Kronecker Product

1 Answers1