Halton sequence is generated by:
- get the representation of i in base-p numeral system
- reverse the bit order
For example, the base-2 Halton sequence:
index binary reversed result
1 1 1 1 / 10 = 1 / 2
2 10 01 01 / 100 = 1 / 4
3 11 11 11 / 100 = 3 / 4
4 100 001 001 / 1000 = 1 / 8
5 101 101 101 / 1000 = 5 / 8
6 110 011 011 / 1000 = 3 / 8
7 111 111 111 / 1000 = 7 / 8
So there is indeed a lot of repeated work in the bit-wise reverse. The first thing we can do is to reuse previous results.
When computing the element with index i in base-p Halton sequence, we first determine the leading bit and the remaining part of the base-p representation of i (this can be done by scheduling threads in a base-p fashion). Then we have
out[i] = out[remaining_part] + leading_bit / p^(length_of_i_in_base_p_representation - 1)
//"^" is used for convenience
To avoid unnecessary global memory read, each thread should handle all elements with the same "remaining part" but different "leading bit".
If we are generating Halton sequence between p^n and p^(n+1), there should conceptually be p^n parallel tasks. However, it causes no problem if we assign a thread with a group of tasks.
Further optimization can be made by mixing re-compute and load-from-memory.
Example code:
Total thread number should be p^m.
const int m = 3 //any value
__device__ void halton(float* out, int p, int N)
{
const int tid = ... //globally unique and continuous thread id
const int step = p^m; //you know what I mean
int w = step; //w is the weight of the leading bit
for(int n = m; n <= N; ++n) //n is the position of the leading bit
{
for(int r = tid; r < w; r += step) //r is the remaining part
for(int l = 1; l < p; ++l) //l is the leading bit
out[l*w + r] = out[r] + l/w;
w *= p;
}
}
note: this example does not compute the first p^m elements in Halton sequence, but these values are still needed.