0

I'm doing matrix multiplication on a GTX1080 GPU using JCuda, version 0.8.0RC with CUDA 8.0. I load two matrices A and B into the device in row-major vector form, and read the product matrix from the device. But I'm finding that I run out of device memory earlier than I would expect. For example, if matrix A is dimensioned 100000 * 5000 = 500 million entries = 2GB worth of float values, then:

cuMemAlloc(MatrixA, 100000 * 5000 * Sizeof.FLOAT); 

works fine. But if I increase the number or rows to 110000 from 100000, I get the following error on this call (which is made before the memory allocations for matrices B and C, so those are not part of the problem):

Exception in thread "main" jcuda.CudaException: CUDA_ERROR_OUT_OF_MEMORY
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:344)
at jcuda.driver.JCudaDriver.cuMemAlloc(JCudaDriver.java:3714)
at JCudaMatrixMultiply.main(JCudaMatrixMultiply.java:84) (my code)

The issue is that allocating a matrix of this size on the device should take only about 2.2GB, and the GTX1080 has 8GB of memory, so I don't see why I'm running out of memory. Does anyone have any thoughts on this? It's true that I'm using the JCuda 0.8.0RC with the release version of CUDA 8, but I tried downloading the RC version of CUDA 8 (8.0.27) to use with JCuda 0.8.0RC and had some problems getting it to work. If versions compatibility is likely to be the issue, however, I can try again.

Matrices of 100000 * 5000 are pretty big, of course, and I won't need to work with larger matrices for a while on my neural network project, but I would like to be confident that I can use all 8GB of memory on this new card. Thanks for any help.

Nicholas Newell
  • 97
  • 1
  • 1
  • 5
  • if you allocate it as a whole chunk then it needs to be continuous chunk of memory, this is very hard to fulfill sometimes. allocation is easier if you divide it into smaller chunks and allocate it more times, if that's possible. – user3528438 Nov 10 '16 at 01:51
  • Thank you - that's something I'll look into. – Nicholas Newell Nov 10 '16 at 03:00
  • 2
    Your numbers also suggest that you might run into an integer overflow issue somewhere in the library. Could you try to allocate 2^31 and 2^31-1 bytes to check? – havogt Nov 10 '16 at 16:16
  • Bravo! Array dimensions 131072 * 4096 (x4) = 2^31 bytes gives the error, while 131071 * 4096 (x4) < 2^31 bytes doesn't. Same for 65536 * 8192 and 65535 * 8192. So probably no way around this outside of partitioning then? – Nicholas Newell Nov 10 '16 at 20:52
  • @Nicholas Newell You are not on a 32-bit OS platform by any chance? Since CUDA itself uses `size_t` for allocation sizes, there should be no problem with allocation of this size on 64-bit platforms, and the problem would therefore appear to be internal to JCuda. Consider contacting the developer or vendor for help, or if it is open-source, check if you can adjust the sources to handle large allocations properly. – njuffa Nov 10 '16 at 21:29
  • @njuffa - Thank you - I'm running Win7 64-bit, so based on what you have said it does look like I should contact the JCuda people. – Nicholas Newell Nov 10 '16 at 21:44
  • 1
    Here I am :-) The last argument to `cuMemAlloc` is a `long`, which is internally simply cast to `size_t`, but the *computation* takes place using `int` - so the computation of `110000 * 5000 * Sizeof.FLOAT` will already overflow **on java side**. Try `(long)110000 ...` (or using the `L` suffix for the literals). If it still doesn't work, drop me a note. – Marco13 Nov 10 '16 at 23:20
  • Thank you! Casting the last argument to long did the trick. Now the sky's the limit! (Up to 8GB anyway) – Nicholas Newell Nov 11 '16 at 14:56

1 Answers1

2

tl;dr:

When calling

cuMemAlloc(MatrixA, (long)110000 * 5000 * Sizeof.FLOAT); 
//                     ^ cast to long here

or alternatively

cuMemAlloc(MatrixA, 110000L * 5000 * Sizeof.FLOAT); 
//                        ^ use the "long" literal suffix here

it should work.


The last argument to cuMemAlloc is of type size_t. This is an implementation-specific unsigned integer type for "arbitrary" sizes. The closest possible primitive type in Java for this is long. And in general, every size_t in CUDA is mapped to long in JCuda. In this case, the Java long is passed as a jlong into the JNI layer, and this is simply cast to size_t for the actual native call.

(The lack of unsigned types in Java and the odd plethora of integer types in C can still cause problems. Sometimes, the C types and the Java types just don't match. But as long as the allocation is not larger than 9 Million Terabytes (!), a long should be fine here...)

But the comment by havogt lead to the right track. What happens here is indeed an integer overflow: The computation of the actual value

110000 * 5000 * Sizeof.FLOAT = 2200000000

is by default done using the int type in Java, and this is where the overflow happens: 2200000000 is larger than Integer.MAX_VALUE. The result will be a negative value. When this is cast to the (unsigned) size_t value in the JNI layer, it will become a ridiculosly large positive value, that clearly causes the error.

When doing the computation using long values, either by explicitly casting to long or by appending the L suffix to one of the literals, the value is passed to CUDA as the proper long value of 2200000000.

Community
  • 1
  • 1
Marco13
  • 53,703
  • 9
  • 80
  • 159