I'm playing around with some hardware entropy creation. I'm using electronic sensors to sample the environment, using A2D with between 12-16 bits of resolution. I'm hypothesizing that the lower-order bits from these readings contain enough noise and real-world entropy to be effectively "random" - at least enough to seed a PRNG.
My approach is to take the samples and progressively truncate them, chopping off the most significant bits, until the results start looking uniformly distributed. And then over time I can pack these lower-order "random" bits together to make larger "random" integer values.
Setting aside the validity of this approach (please), my real problem is evaluating the "randomness" of the data. More precisely, measuring the uniformity of the distribution of the truncated samples. Again, I'm shaving off higher-order bits progressively till the resulting values look more-or-less uniformly distributed... but having trouble coming up with a way to statistically measure the uniformity, even roughly.
I first tried the Chi-Square approach, but apparently this doesn't handle large data sets well? (I've got 10k samples per sensor.)
Then I moved on to the Kolmogorov-Smirnov Test, trying to compare my data set to the uniform distribution. The kstest()
method expects floating point values between 0 and 1. I can normalize my integer samples by dividing each value by the max value (2^bits) to fit this requirement, but when the bit-width gets low, the values are still going to be very discrete (for example, with truncated width of 2 bits, there will only be values 0.0, 0.25, 0.5 and 0.75).
I decided to test this with the following code:
import random
import math
from scipy.stats import kstest
if __name__ == '__main__':
samples = 1000
random.seed(0)
print('samples: {}'.format(samples))
for bits in range(1, 16):
max_val = (2 ** bits) - 1
obs = []
for i in range(samples):
int_val = random.randint(0, max_val)
obs.append(float(int_val) / max_val)
stat, p = kstest(obs, 'uniform')
print('{:3} bits : p={}'.format(bits, p))
I'm not sure how to interpret the results. It seems to show that the algorithm doesn't work well for low bit counts, but the results still vary a lot as you add bits. If you comment out the seed(0)
line, you'll see p values vary wildly.
Here's the values for seed zero:
samples: 1000
1 bits : p=1.1061100101127136e-234
2 bits : p=5.3150572686544944e-62
3 bits : p=4.4512615958880796e-14
4 bits : p=0.0006808631612435064
5 bits : p=0.14140010855519292
6 bits : p=0.20814194572050182
7 bits : p=0.49167945198677787
8 bits : p=0.496368700963041
9 bits : p=0.8532040305202887
10 bits : p=0.6961716307178489
11 bits : p=0.41693384153785173
12 bits : p=0.8484539510181118
13 bits : p=0.16853102951415444
14 bits : p=0.8175442993699236
15 bits : p=0.30957048529568965
My questions:
- Is there a better way to evaluate the uniform-distribution-ness of a range of integer values (that is, a range from zero to some power of 2), particularly in the case of fewer bits per sample (1-4)? I'm looking for some python code that will digest a bunch of samples and give me some quantitative and meaningful evaluation of how uniform the distribution appears to be.
- Am I applying the K-S test properly here? I would think that as the numbers of bits-per-sample goes up, the resulting normalized values would appear more continuous (less discrete), which I further assume would make the p-value "get better". But after 5-6 bits, it doesn't seem to matter - it seems to wander all over the place. So I'm wondering if I'm doing it right or maybe misunderstanding what I'm measuring.