Thanks to the answers here yesterday, I think I now have a correct basic test of unified memory using Pascal 1080Ti. It allocates a 50GB single dimension array and adds it up. If I understand correctly, it should be memory bound since this test is so simple (adding integers). However, it takes 24 seconds equating to about 2GB/s. When I run the CUDA8 bandwidthTest I see higher rates: 11.7GB/s pinned and 8.5GB/s pageable.
Is there any way to get the test to run faster than 24 seconds?
Here's the full test code :
$ cat firstAcc.c
#include <stdio.h>
#include <openacc.h>
#include <stdlib.h>
#include <time.h>
#define GB 50
static double wallclock()
{
double ans = 0;
struct timespec tp;
if (0==clock_gettime(CLOCK_REALTIME, &tp))
ans = (double) tp.tv_sec + 1e-9 * (double) tp.tv_nsec;
return ans;
}
int main()
{
int *a;
size_t n = (size_t)GB*1024*1024*1024/sizeof(int);
size_t s = n * sizeof(int);
printf("n = %lu, GB = %.3f\n", n, (double)s/(1024*1024*1024));
a = (int *)malloc(s);
if (!a) { printf("Failed to malloc.\n"); return 1; }
setbuf(stdout, NULL);
double t0 = wallclock();
printf("Initializing ... ");
for (long i = 0; i < n; ++i) {
a[i] = i%7-3;
}
double t1 = wallclock();
printf("done in %f (single CPU thread)\n", t1-t0);
t0=t1;
int sum=0.0;
#pragma acc parallel loop reduction (+:sum)
for (long i = 0; i < n; ++i) {
sum+=a[i];
}
t1 = wallclock();
printf("Sum is %d and it took %f\n", sum, t1-t0);
free(a);
return 0;
}
I compile it as follows :
$ pgcc -fast -acc -ta=tesla:managed:cc60 -Minfo=accel firstAcc.c
main:
40, Accelerator kernel generated
Generating Tesla code
40, Generating reduction(+:sum)
41, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
40, Generating implicit copyin(a[:13421772800])
Then I run it twice :
$ ./a.out
n = 13421772800, GB = 50.000
Initializing ... done in 36.082607 (single CPU thread)
Sum is -5 and it took 23.902612
$ ./a.out
n = 13421772800, GB = 50.000
Initializing ... done in 36.001578 (single CPU thread)
Sum is -5 and it took 24.180615
The result (-5) is correct as I setup the data that way. The numbers are repeated sequences of 7 integers -3:+3 which when summed all cancel out other than the remainder of 2 at the end (-3 -2 = -5).
The bandwidthTest (CUDA 8 samples/1_Utilities) result for pageable is :
$ ./bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 1080 Ti
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 8576.7
Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11474.3
Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 345412.1
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
I see that note. But what should I use instead? Do these measurements seem in the right ballpark?
Is there anything that can be done to make the test run in more like 6 seconds (50GB / 8.5GB/s) rather than 25s?
The result with --mode=shmoo actually shows pageable reaching a higher rate: 11GB/s.
$ ./bandwidthTest --memory=pageable --mode=shmoo
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 1080 Ti
Shmoo Mode
.................................................................................
Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
1024 160.3
2048 302.1
3072 439.2
4096 538.4
5120 604.6
6144 765.3
7168 875.0
8192 979.2
9216 1187.3
10240 1270.6
11264 1335.0
12288 1449.3
13312 1579.6
14336 1622.2
15360 1836.0
16384 1995.0
17408 2133.0
18432 2189.8
19456 2289.2
20480 2369.7
22528 2525.8
24576 2625.8
26624 2766.0
28672 2614.4
30720 2895.8
32768 3050.5
34816 3151.1
36864 3263.8
38912 3339.2
40960 3395.6
43008 3488.4
45056 3557.0
47104 3642.1
49152 3658.5
51200 3736.9
61440 4040.4
71680 4076.9
81920 4310.3
92160 4522.6
102400 4668.5
204800 5461.5
307200 5820.7
409600 6003.3
512000 6153.8
614400 6232.5
716800 6285.9
819200 6368.9
921600 6409.3
1024000 6442.5
1126400 6572.3
2174976 8239.3
3223552 9041.6
4272128 9524.2
5320704 9824.5
6369280 10065.2
7417856 10221.2
8466432 10355.7
9515008 10452.8
10563584 10553.9
11612160 10613.1
12660736 10680.3
13709312 10728.1
14757888 10763.8
15806464 10804.4
16855040 10838.1
18952192 10820.9
21049344 10949.4
23146496 10990.7
25243648 11021.6
27340800 11028.8
29437952 11083.2
31535104 11098.9
33632256 10993.3
37826560 10616.5
42020864 10375.5
46215168 10186.1
50409472 10085.4
54603776 10013.9
58798080 10004.8
62992384 9998.6
67186688 10006.4
Thanks in advance.
$ pgcc -V
pgcc 17.4-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
$ cat /usr/local/cuda-8.0/version.txt
CUDA Version 8.0.61