0

I want to write a kernel in CUDA that would generate the Halton sequence in parallel, with 1 value generated and stored by each thread.

Looking at the sequence, it would seem that generating each subsequent value in the sequence involves work done in generating the previous value. Generating each value from scratch would involve redundant work and cause a large gap between execution times of threads.

Is there any way to do this with a parallel kernel that improves upon the serial algorithm? I'm really new to parallel programming so pardon the question if the answer is some well-known pattern.

Note: I did find this link in a textbook (which uses it without describing how it works) but the file link there is dead.

Aakash Jain
  • 1,963
  • 17
  • 29
  • 1
    You might want to look at the Sobol quasi-random generator in CURAND as an alternative. If the recurrence for the Halton sequence is expressible as a matrix-vector multiplication between transformation matrix and state vector, you can parallelize it by having thread i compute M**i in O(log(n)) time, then perform the matrix-vector multiplication to generate the i-th state vector. You could also precompute powers of M to cut down on redundant computation. – njuffa Apr 10 '15 at 08:32

1 Answers1

1

Halton sequence is generated by:

  1. get the representation of i in base-p numeral system
  2. reverse the bit order

For example, the base-2 Halton sequence:

index      binary     reversed     result
1             1           1           1 /   10 = 1 / 2
2            10          01          01 /  100 = 1 / 4
3            11          11          11 /  100 = 3 / 4
4           100         001         001 / 1000 = 1 / 8
5           101         101         101 / 1000 = 5 / 8
6           110         011         011 / 1000 = 3 / 8
7           111         111         111 / 1000 = 7 / 8

So there is indeed a lot of repeated work in the bit-wise reverse. The first thing we can do is to reuse previous results.

When computing the element with index i in base-p Halton sequence, we first determine the leading bit and the remaining part of the base-p representation of i (this can be done by scheduling threads in a base-p fashion). Then we have

out[i] = out[remaining_part] + leading_bit / p^(length_of_i_in_base_p_representation - 1)
//"^" is used for convenience

To avoid unnecessary global memory read, each thread should handle all elements with the same "remaining part" but different "leading bit". If we are generating Halton sequence between p^n and p^(n+1), there should conceptually be p^n parallel tasks. However, it causes no problem if we assign a thread with a group of tasks.

Further optimization can be made by mixing re-compute and load-from-memory.

Example code:

Total thread number should be p^m.

const int m = 3 //any value
__device__ void halton(float* out, int p, int N)
{
    const int tid = ... //globally unique and continuous thread id
    const int step = p^m; //you know what I mean
    int w = step; //w is the weight of the leading bit
    for(int n = m; n <= N; ++n) //n is the position of the leading bit
    {
        for(int r = tid; r < w; r += step) //r is the remaining part
        for(int l = 1; l < p; ++l) //l is the leading bit
            out[l*w + r] = out[r] + l/w;
        w *= p;
    }
}

note: this example does not compute the first p^m elements in Halton sequence, but these values are still needed.

Huazuo Gao
  • 1,603
  • 14
  • 20