7

I need to generate a binary file containing only unique random numbers, with single precision. The purpose is then to calculate the entropy of this file and use it with other datasets entropy to calculate a ratio entropy_file/entropy_randUnique. This value is named "randomness".

I can do this in python with double-precision numbers and inserting them into set(), using struct.pack like so:

    numbers = set()
    while len(numbers) < size:
        numbers.add(struct.pack(precision,random.random()))
    for num in numbers:
        file.write(num)

but when I change to single precision I can't just change the pack method (that will produce a lot of the same numbers and the while will never end), and I can't generate single precision numbers with random. I've looked into numpy but the generator works the same way from what I understood. How can I get 370914252 (this is my biggest test case) unique float32 inside a binary file, even if they're not random, I think that a shuffled sequence would suffice..

SamGamgee
  • 564
  • 1
  • 4
  • 12

1 Answers1

3

Your best bet is to generate random 32-bit integers then convert them to floating point. You'll need to reject bit representations of infinity and NAN as you generate the numbers.

You can generate your set from the integer values rather than the floating point ones, then do the conversion on output. Rather than using a set, you could use a bit map to detect which integer values have already been used; that's more likely to fit in memory, especially given the largest sample size you indicate.

def random_unique_floats(n):
    used = bytearray(0 for i in xrange(2**32 // 8))
    count = 0
    while count < n:
        bits = random.getrandbits(32)
        value = struct.unpack('f', struct.pack('I', bits))[0]
        if not math.isinf(value) and not math.isnan(value):
            index = bits // 8
            mask = 0x01 << (bits & 0x07)
            if used[index] & mask == 0:
                yield value
                used[index] |= mask
                count += 1

for num in random_unique_floats(size):
    file.write(struct.pack('f', num))

Note that as your number of samples approaches the number of possible floating-point values, the run time will go up exponentially.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • I've never used `yield`, I am trying this `for num in random_unique_floats(size): file.write(num)` which I'm not sure it is correct. It is giving an error "used = bytearray(0 for i in range(2**32 / 8)) TypeError: 'float' object cannot be interpreted as an integer" edit: I've changed xrange to range because I'm using python3 – SamGamgee Nov 20 '13 at 21:35
  • @SamGamgee, then you need to use `2**32 // 8` to get integer division. I'll edit the answer. I'm leaving the `xrange` as is though so the answer still works for Python 2. – Mark Ransom Nov 20 '13 at 22:35
  • Ok that makes sense (also thank you for teaching me the // division :)) I am running the program now with increasing samples, so far it appears to be working correctly! Will mark as the answer soon. – SamGamgee Nov 21 '13 at 12:11