Uniformity test on a large set using scipy.stats.kstest

Question

I'm playing around with some hardware entropy creation. I'm using electronic sensors to sample the environment, using A2D with between 12-16 bits of resolution. I'm hypothesizing that the lower-order bits from these readings contain enough noise and real-world entropy to be effectively "random" - at least enough to seed a PRNG.

My approach is to take the samples and progressively truncate them, chopping off the most significant bits, until the results start looking uniformly distributed. And then over time I can pack these lower-order "random" bits together to make larger "random" integer values.

Setting aside the validity of this approach (please), my real problem is evaluating the "randomness" of the data. More precisely, measuring the uniformity of the distribution of the truncated samples. Again, I'm shaving off higher-order bits progressively till the resulting values look more-or-less uniformly distributed... but having trouble coming up with a way to statistically measure the uniformity, even roughly.

I first tried the Chi-Square approach, but apparently this doesn't handle large data sets well? (I've got 10k samples per sensor.)

Then I moved on to the Kolmogorov-Smirnov Test, trying to compare my data set to the uniform distribution. The kstest() method expects floating point values between 0 and 1. I can normalize my integer samples by dividing each value by the max value (2^bits) to fit this requirement, but when the bit-width gets low, the values are still going to be very discrete (for example, with truncated width of 2 bits, there will only be values 0.0, 0.25, 0.5 and 0.75).

I decided to test this with the following code:

import random
import math
from scipy.stats import kstest 

if __name__ == '__main__':  
    samples = 1000
    random.seed(0)
    print('samples: {}'.format(samples))

    for bits in range(1, 16):
        max_val = (2 ** bits) - 1
        obs = []
        for i in range(samples):
            int_val = random.randint(0, max_val)
            obs.append(float(int_val) / max_val)
        stat, p = kstest(obs, 'uniform')
        print('{:3} bits : p={}'.format(bits, p))

I'm not sure how to interpret the results. It seems to show that the algorithm doesn't work well for low bit counts, but the results still vary a lot as you add bits. If you comment out the seed(0) line, you'll see p values vary wildly.

Here's the values for seed zero:

samples: 1000
  1 bits : p=1.1061100101127136e-234
  2 bits : p=5.3150572686544944e-62
  3 bits : p=4.4512615958880796e-14
  4 bits : p=0.0006808631612435064
  5 bits : p=0.14140010855519292
  6 bits : p=0.20814194572050182
  7 bits : p=0.49167945198677787
  8 bits : p=0.496368700963041
  9 bits : p=0.8532040305202887
 10 bits : p=0.6961716307178489
 11 bits : p=0.41693384153785173
 12 bits : p=0.8484539510181118
 13 bits : p=0.16853102951415444
 14 bits : p=0.8175442993699236
 15 bits : p=0.30957048529568965

My questions:

Is there a better way to evaluate the uniform-distribution-ness of a range of integer values (that is, a range from zero to some power of 2), particularly in the case of fewer bits per sample (1-4)? I'm looking for some python code that will digest a bunch of samples and give me some quantitative and meaningful evaluation of how uniform the distribution appears to be.
Am I applying the K-S test properly here? I would think that as the numbers of bits-per-sample goes up, the resulting normalized values would appear more continuous (less discrete), which I further assume would make the p-value "get better". But after 5-6 bits, it doesn't seem to matter - it seems to wander all over the place. So I'm wondering if I'm doing it right or maybe misunderstanding what I'm measuring.

Okay... I think the key here was that I was feeding `kstest()` a [PDF](https://en.wikipedia.org/wiki/Probability_density_function) (in the form of a "normalized histogram"), not a [CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function). — cbp2, Jul 25 '22 at 15:35

score 1 · Answer 1 · answered Jul 25 '22 at 03:01

1

K-S test (see wiki) is the distance between your sampled (aka empirical) CDF and ideal (aka model) CDF. CDF there is cumulative distribution function.

For uniform distribution CDF is very simple.

def uniCDF(x):
    if x < 0.0:
        return 0.0
    if x > 1.0:
        return 1.0
    return x

So make huge array of your samples (whatever way you got them), and call kstest with something like

kstest(samples, uniCDF)

and see how it is

answered Jul 25 '22 at 03:01

Severin Pappadeux

18,636
3
38
64

Thanks for this... the key point here is that I need to give `kstest` a CDF, not a PDF. – cbp2 Jul 25 '22 at 15:44
@cbp2 Yes, absolutely! KS test is all about CDF-vs-CDF, as I wrote – Severin Pappadeux Jul 25 '22 at 16:20

score 1 · Answer 2 · answered Jul 25 '22 at 15:43

The other answer posted wasn't really the solution, but it did highlight what I was doing wrong. I wasn't using kstest() correctly. It wants a CDF, not a PDF - and I was effectively giving it a PDF.

Here's the updated test code:

import random
import math
from scipy.stats import kstest
import numpy as np

if __name__ == '__main__':
    samples = 1000
    random.seed(0)
    print('samples: {}'.format(samples))

    for bits in range(1, 16):
        max_val = 2 ** bits
        obs = []
        for i in range(samples):
            obs.append(np.random.randint(max_val))

        count, bins_count = np.histogram(obs, bins='auto')
        pdf = count / sum(count)
        cdf = np.cumsum(pdf)
        stat, p = kstest(cdf, 'uniform')
        print('{:3} bits : p={}'.format(bits, p))

This gave the following results:

samples: 1000
  1 bits : p=0.002865197735232071
  2 bits : p=0.3578313345590808
  3 bits : p=0.9379905983222332
  4 bits : p=0.9492173736681953
  5 bits : p=0.9990666389407863
  6 bits : p=0.9991168380270133
  7 bits : p=0.9884493351721916
  8 bits : p=0.9979292700419209
  9 bits : p=0.9984858969520626
 10 bits : p=0.9927394764400753
 11 bits : p=0.9991880763266139
 12 bits : p=0.9998600940511319
 13 bits : p=0.9971921916005222
 14 bits : p=0.9996272719319321
 15 bits : p=0.999437259974763

That at least makes more sense to me.

So it looks like the K-S test might work on data down to 3 bits or so. That will probably have to do.

Uniformity test on a large set using scipy.stats.kstest

2 Answers2