-ta=tesla:managed:cuda8 but cuMemAllocManaged returned error 2: Out of memory

Question

I'm new to OpenACC. I like it very much so far as I'm familiar with OpenMP.

I have 2 1080Ti cards each with 9GB and I've 128GB of RAM. I'm trying a very basic test to allocate an array, initialize it, then sum it up in parallel. This works for 8 GB but when I increase to 10 GB I get out-of-memory error. My understanding was that with unified memory of Pascal (which these card are) and CUDA 8, I could allocate an array larger than the GPU's memory and the hardware will page in and page out on demand.

Here's my full C code test :

$ cat firstAcc.c 

#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>

#define GB 10

int main()
{
  float *a;
  size_t n = GB*1024*1024*1024/sizeof(float);
  size_t s = n * sizeof(float);
  a = (float *)malloc(s);
  if (!a) { printf("Failed to malloc.\n"); return 1; }
  printf("Initializing ... ");
  for (int i = 0; i < n; ++i) {
    a[i] = 0.1f;
  }
  printf("done\n");
  float sum=0.0;
  #pragma acc loop reduction (+:sum)
  for (int i = 0; i < n; ++i) {
    sum+=a[i];
  }
  printf("Sum is %f\n", sum);
  free(a);
  return 0;
}

As per the "Enable Unified Memory" section of this article I compile it with :

$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo firstAcc.c
main:
 20, Loop not fused: function call before adjacent loop
     Generated vector simd code for the loop
 28, Loop not fused: function call before adjacent loop
     Generated vector simd code for the loop containing reductions
     Generated a prefetch instruction for the loop

I need to understand those messages but for now I don't think they are relevant. Then I run it :

$ ./a.out
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted (core dumped)

This works fine if I change GB to 8. I expected 10GB to work (despite the GPU card having 9GB) thanks to Pascal 1080Ti and CUDA 8.

Have I misunderstand, or what am I doing wrong? Thanks in advance.

$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell 
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

$ cat /usr/local/cuda-8.0/version.txt 
CUDA Version 8.0.61

I'm suspicious of this: `size_t n = GB*1024*1024*1024/sizeof(float);` When I compile that with GNU g++ 4.8.2, I get a warning about integer overflow. What happens if you print out `n` and `s` immediately after you assign them? When I do that, I get numbers that are way too large. Try adding `ULL` after **all** of your constants. — Robert Crovella, May 02 '17 at 20:29
@RobertCrovella How embarassing. Yes it was that. Works now. Thank you! — Matt Dowle, May 02 '17 at 20:38

score 5 · Answer 1 · edited May 02 '17 at 23:20

5

Besides what Bob mentioned, I made a few more fixes.

First, you're not actually generating an OpenACC compute region since you only have a "#pragma acc loop" directive. This should be "#pragma acc parallel loop". You can see this in the compiler feedback messages where it's only showing host code optimizations.

Second, the "i" index should be declared as a "long". Otherwise, you'll overflow the index.

Finally, you need to add "cc60" to your target accelerator options to tell the compiler to target a Pascal based GPU.

% cat mi.c  
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>

#define GB 20ULL

int main()
{
  float *a;
  size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
  size_t s = n * sizeof(float);
  printf("n = %lu, s = %lu\n", n, s);
  a = (float *)malloc(s);
  if (!a) { printf("Failed to malloc.\n"); return 1; }
  printf("Initializing ... ");
  for (int i = 0; i < n; ++i) {
    a[i] = 0.1f;
  }
  printf("done\n");
  double sum=0.0;
  #pragma acc parallel loop reduction (+:sum)
  for (long i = 0; i < n; ++i) {
    sum+=a[i];
  }
  printf("Sum is %f\n", sum);
  free(a);
  return 0;
}

% pgcc -fast -acc -ta=tesla:managed,cuda8.0,cc60 -Minfo=accel mi.c
main:
     21, Accelerator kernel generated
         Generating Tesla code
         21, Generating reduction(+:sum)
         22, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     21, Generating implicit copyin(a[:5368709120])
% ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000

edited May 02 '17 at 23:20

Robert Crovella

143,785
11
213
257

answered May 02 '17 at 23:03

Mat Colgrove

5,441
1
10
11

Excellent - many thanks! I had read in `man pgcc` the sentence `Specifying cc60 also implies the cuda8.0 option` and I had wondered what `cc60` was. I will use `cc60` instead of `cuda8.0` then. Is there anything like `-mtune=native` that just detects the hardware for me? – Matt Dowle May 03 '17 at 00:16
Yes, you are correct that "cc60" (Compute Capability 6.0, aka Pascal) implies CUDA 8.0. On the host side, the compiler does auto-detect the CPU architecture if not explicitly set via the "-tp" flag. For the device, pgcc defaults to targeting Fermi, Kepler, and Maxwell i.e. "-ta=tesla:cc30,cc35,cc50" but does not auto-detect what device is on the system (often there isn't an attached device on the compilation system). We didn't want to add Pascal devices by default since many of our customers haven't moved to using CUDA 8.0 drivers. – Mat Colgrove May 03 '17 at 14:23
Ok makes sense. Thanks. – Matt Dowle May 03 '17 at 18:40
Any chance you know the answer to my next question here. The answer there is a bit muddled about unified memory in cuda6 but I'm asking specifically about new improved hardware page-fault abilities of Pascal and CUDA8 : http://stackoverflow.com/questions/43768717/speed-of-pascal-cuda8-1080ti-unified-memory#comment74582907_43770299 – Matt Dowle May 04 '17 at 02:35
Looks like other folks have jumped in with the current answer of calling cudaMemAdvice and cudaMemPrefetch. At PGI, we are looking at ways to improve the user experience here. However given this is still in the research phase, I'm unfortunately not able to give details other than we're looking into it. – Mat Colgrove May 04 '17 at 14:45

Robert Crovella · Accepted Answer · 2017-05-02T23:13:20.603

I believe a problem is here:

size_t n = GB*1024*1024*1024/sizeof(float);

when I compile that line of code with g++, I get a warning about integer overflow. For some reason the PGI compiler is not warning, but the same badness is occurring under the hood. After the declarations of s, and n, if I add a printout like this:

  size_t n = GB*1024*1024*1024/sizeof(float);
  size_t s = n * sizeof(float);
  printf("n = %lu, s = %lu\n", n, s);  // add this line

and compile with PGI 17.04, and run (on a P100, with 16GB) I get output like this:

$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
     16, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop
     22, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
$ ./a.out
n = 4611686017890516992, s = 18446744071562067968
malloc: call to cuMemAllocManaged returned error 2: Out of memory
Aborted
$

so it's evident that n and s are not what you intended.

We can fix this by marking all of those constants with ULL, and then things seem to work correctly for me:

$ cat m1.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>

#define GB 20ULL

int main()
{
  float *a;
  size_t n = GB*1024ULL*1024ULL*1024ULL/sizeof(float);
  size_t s = n * sizeof(float);
  printf("n = %lu, s = %lu\n", n, s);
  a = (float *)malloc(s);
  if (!a) { printf("Failed to malloc.\n"); return 1; }
  printf("Initializing ... ");
  for (int i = 0; i < n; ++i) {
    a[i] = 0.1f;
  }
  printf("done\n");
  double sum=0.0;
  #pragma acc loop reduction (+:sum)
  for (int i = 0; i < n; ++i) {
    sum+=a[i];
  }
  printf("Sum is %f\n", sum);
  free(a);
  return 0;
}
$ pgcc -acc -fast -ta=tesla:managed:cuda8 -Minfo m1.c
main:
     16, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop
     22, Loop not fused: function call before adjacent loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
$ ./a.out
n = 5368709120, s = 21474836480
Initializing ... done
Sum is 536870920.000000
$

Note that I've made another change above as well. I changed the sum accumulation variable from float to double. This is necessary to preserve somewhat "sensible" results when doing a very large reduction across very small quantities.

And, as @MatColgrove pointed out in his answer, I missed a few other things as well.

Huge thank you. Indeed that was it and it works great now. – Matt Dowle May 02 '17 at 20:50 — Matt Dowle, May 02 '17 at 20:50

-ta=tesla:managed:cuda8 but cuMemAllocManaged returned error 2: Out of memory

2 Answers2

Linked