Vectorized Ranged Random number generation across all types

Question

I want to support the following operation in C++:

void generate_random_simd(T* array, T upper_bound, T lower_bound) {
 // uses simd instructions for rng in range [lower_bound, upper_bound]
}

The type T could be any uint, int or float type - 32 or 64 bit. Is there an efficient implementation either available directly or some literature on this material?

I did find a few implementations, such as this and this. But they don't support all the above types nor do they support supplying an upper-lower bound. Using them might in turn require additional processing to achieve the result whose overhead Im afraid would end up being equivalent to simply looping and using a standard C++ random number generator(non-simd).

The only paper (and implementation) I know of can be found [here](https://out.reddit.com/t3_7nlwjr?url=https%3A%2F%2Fwww.ti.uni-bielefeld.de%2Fdownloads%2Fpublications%2FtemplateSIMD.pdf&token=AQAAjDkEXJJan3AOWCv6zdTPxUcP_zdMaR9vVpj77Cfy6am9_Wtc&app_name=desktop2x) . This paper is a very general approach, but I believe you could modify it for your needs. — Thomas Lang, Dec 02 '18 at 18:59
I'm more interested in *why* you want your own PRNG? And why does it need to be "vectorized"? Isn't [the standard PRNG functionality](https://en.cppreference.com/w/cpp/numeric/random) enough? — Some programmer dude, Dec 02 '18 at 19:01
If I want to generate a huge random array in minimal time, using simd might aid in making things faster right? — tangy, Dec 02 '18 at 19:05
Maybe. Or maybe not. At least take a look at the standard library PRNG facilities, they are pretty good, and probably very highly optimized (like the rest of the standard library). Better measure and profile, I'l doubt this is going to be a bottleneck. Actually, generating a "large" number is probably not going to be measurable slower (or faster) than a small number. — Some programmer dude, Dec 02 '18 at 19:10
Thanks @Someprogrammerdude I completely agree and definitely plan on microbenchmarking before usage. I was just hoping to be pointed to the correct resources via this question. — tangy, Dec 02 '18 at 19:14
@ThomasLang thanks for pointing me to that paper - I'll have a look thanks. — tangy, Dec 02 '18 at 19:14
I'm not sure why the downvote - If someone told me I can accordingly edit/avoid repeating such a question? — tangy, Dec 02 '18 at 19:15
Then no more objections from me. Experimenting is good and fun. But don't forget the measurements. And that premature optimizations are usually bad. And that any optimizations (which tend to make code hard to read and maintain) need thorough documentation (especially the reason non-standard solutions was chosen, if you go that path). :) — Some programmer dude, Dec 02 '18 at 19:23
As for the "all types" thing, templates (possibly with specialization) is a given. Then it even might be possible to support even non-numeric types as long as they support the operations you perform (like having a valid `operator<` function defined). — Some programmer dude, Dec 02 '18 at 19:26

score 2 · Accepted Answer · edited Dec 03 '18 at 16:21

Element boundaries only matter when you have lower/upper bounds. Otherwise for integer you just want 128 or 256 bits of random data in a SIMD vector.

For example, you can use an SSE2 / AVX2 xorshift+ which runs multiple xorshift+ generators in 64-bit SIMD elements. You can treat this as 16x uint8_t, or 2x uint64_t, or anything in between, when you want to actually use the random data for something.

Here's an example of using that as 16-bit elements -> multiple vectors of decimal digits, in my answer on What's the fastest way to generate a 1 GB text file containing random digits? over on unix.SE. (Written in C with Intel intrinsics, with Core 2, Haswell, and Skylake benchmark numbers).

It runs fast enough that you'll want to consume the output while it's still hot in cache, e.g. cache-block in chunks of 4 or 8 kiB for L1d hits. Or simply use a vector of random numbers as they're produced.

You can of course use a different divisor, and add something to each element to get a range other than 0..upper. But it's most efficient with a compile-time-constant range. Still, you could use libdivide for SIMD division (or modulo) by a runtime variable.

With unknown upper/lower bounds, you probably only want to use the input vector for one vector of results. When I was optimizing for max speed, it made sense to process multiple 0..9 digits out of a 16-bit integer, to save xorshift+ work. 0..9 is such a small fraction of 0..65535 that there's plenty of entropy left, and has a different bias than the first remainder.

FP is harder than integer because some bit-patterns represent NaN. And you often want a uniform distribution along the real number line, not a uniform distribution of finite bit patterns. (Half of all representable float values have magnitude less than 1.0. The closer to zero you get, the closer together floats can be.)

Apparently it's common to generate uniform FP random numbers in a [0,1.0) range. (1/4 of the total representable values.) Scaling oto a [0, N) range with a multiply works fine for N < 2^24, but for larger than that you start to lose entropy and introduce bias, according to Daniel Lemire's article, "How many floating-point numbers are in the interval [0,1]?".

Depending on the size of your range, it seems to me it's a lot easier to generate them in the [1.0, 2.0) range (or any other single-exponent range), by combining a 23-bit random significand (mantissa) with a fixed exponent / sign-bit.

That's fewer bits of entropy, but is trivially uniform and can be done with a SIMD _mm_and_ps and _mm_or_ps. (Unfortunately for this, the significand is only 23 bits wide, not a multiple of 8 or 16, so we can't simply use _mm_blendv_epi8 or _mm_blend_epi16)

If you want a distribution other than uniform, (https://en.wikipedia.org/wiki/Random_number_generation#Generation_from_a_probability_distribution), e.g. Gaussian or Poisson, you'll have to find an algorithm for that.

Sampling with rejection doesn't work well for SIMD because of the required branching. You could maybe do 2 candidate vectors of random numbers and branchlessly combine them, then branch if any still need to be rejected.

Maybe left-packing the non-rejected candidates would let you fairly efficiently fill a buffer with random numbers, producing a variable number each iteration. See AVX2 what is the most efficient way to pack left based on a mask? for SSE2 / AVX2 / AVX512 left-packing.

Again, keep the buffer chunk size small enough that you get L1d or at least L2 cache hits when you loop back over it.

Thanks for pointing to libdivide for the modulo - should be handy. I had 2 follow ups if you don't mind: (1) For floats is there something like a SIMD multiplication library? What I thought was that I could use a standard implementation something like rand(0, 1) -> scale it to [LB, UB) using a simple approach. (2) This entire operation seems to be io bound. Do you expect a potential benefit by introducing simd for the random float/int with runtime upper/lower bound cases vs. non-simd? — tangy, Dec 02 '18 at 21:23
@tangy: you don't need a library for FP multiplication, CPUs already have it built-in. On x86 you can use `_mm_mul_ps`. See https://stackoverflow.com/tags/sse/info. It's not like `rand(0,1)` is any easier to implement, though. (2) it's only I/O bound if you do any I/O. If you use the SIMD vectors of random numbers as they're generated, they'll already be in a register. Or generating a small block of like 4kiB or so will still be hot in L1d cache when you come back and read it, but that still costs you the store / reload and can't overlap generation with the computation that uses them. — Peter Cordes, Dec 02 '18 at 21:38
`rand(0, 1)` is available via one of the library implementations linked to above fortunately. Yes, I was asking about the case where I am not overlapping generation with computation and the size of the array might be > 4K. — tangy, Dec 02 '18 at 22:08
@tangy: then don't do that. Cache-block your loop to generate 4 or 8k of random data, then process it, then generate more, etc. until you're done. You don't want to bottleneck on memory bandwidth; preferably L1d bandwidth, or even L2 bandwidth is *much* better (and private per-core, so you aren't competing with other cores). — Peter Cordes, Dec 02 '18 at 22:32
@tangy: updated with more info about random FP numbers. Multiplying a `[0,1)` random number is apparently fine up to a scale factor of about `2^24`, but you might be able to make the process more efficient by generating random numbers in a custom range in the first place, at least with a power of 2 upper bound. — Peter Cordes, Dec 03 '18 at 01:00

Vectorized Ranged Random number generation across all types

1 Answers1

Linked