Percentile calculation on GPU cluster with CuPy

Question

I am looking for ways to efficiently utilise my GPU cluster to calculate percentile in randomly generated arrays for my monte-carlo simulation. I would assume that GPU would be faster than similar calculations on CPU. When I compare CuPy and Numpy on single threaded process obviouslly there is a significant performance improvement:

import cupy as cp
sample_size = 9000000
cp_res = cp.random(10,0.1,size=(400*sample_size),dtype=cp.float32)
print(cp.percentile(cp_res, 0.05))

This runs in 226ms

What would be the most efficient way to run percentile on 4400sample_size random across 2 servers with 2 GPUs each?

I am running:

client = Client(cluser_ip_address)
rs = da.random.RandomState(RandomState=cp.random.RandomState) 
x = rs.normal(10, 0.1, size=(4*400*sample_size), chunks='auto')
print(dask.array.percentile(x, 0.05).compute())

I am getting this error: TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

When I swap cupy with numpy the code works fine. Am I using it wrongly? Is there an alternative way to use GPU to generate a large array on normally distributed numbers and calculate its percentile?

Are you trying to run a larger number of simulations on data that fits in single GPU memory or one simulation on data too large for a single GPU? Your Dask example is still running one percentile calculation. — Nick Becker, Dec 08 '20 at 14:27
I am trying to run large number of simulation. Basically generate 10000 arrays of 9M elements and calculate percentile. I can fit only 140 into a single GPU so I was thinking I maybe be able to distribute the calculations across 4 cards. — Genadyc, Dec 09 '20 at 00:52
I'll answer this question with a small example tomorrow using Dask. What you're describing will work well. — Nick Becker, Dec 09 '20 at 03:38
Actually, hold on. Are you trying to create one vector of 10000 * 9M elements and get the percentile of that, or the 10000 percentiles each on a vector of 9M elements? — Nick Becker, Dec 09 '20 at 14:33
Apologies for the confusion. I want to generate 9M*10000 normally distributed numbers and calculate the percentile of that. — Genadyc, Dec 10 '20 at 03:54
Great. Your existing code will work if you use "lower" as the interpolation method and use a LocalCUDACluster, though it may or may not be faster than the equivalent Dask CPU cluster. I will show a brief example. — Nick Becker, Dec 10 '20 at 14:45
This answer should provide guidance on how to set up a Dask CUDA cluster. You'll have to benchmark whether this calculation is faster with CuPy. It's possible the memory constraints will result in it being faster with a Dask CPU cluster. https://stackoverflow.com/questions/58114113/cudf-error-processing-a-large-number-of-parquet-files/58123478#58123478 — Nick Becker, Dec 10 '20 at 15:17
Still doesn't really work. I've updated my initial question with more details — Genadyc, Dec 11 '20 at 07:12
I tried my distributed cluster and LocalCUDACluster with the same results — Genadyc, Dec 11 '20 at 07:25

Percentile calculation on GPU cluster with CuPy

0 Answers0