0

I'm trying to sample a discrete distribution using the std::discrete_distribution function. Here is a mwe:

// discrete_distribution
#include <iostream>
#include <random>

int main()
{
  const int nrolls = 10000; // number of experiments
  const int nstars = 100;   // maximum number of stars to distribute
  std::vector<double> weights;
  weights = {1.28503e-22, 1.67881e-17, 8.99861e-13, 1.70418e-08, 9.27031e-05,
    0.106935, 16.1967, 140.325, 16.1967, 0.106935, 9.27031e-05, 1.70418e-08,
    8.99861e-13, 1.67881e-17, 1.28503e-22};

  std::default_random_engine generator;
  std::discrete_distribution<int> distribution(weights.begin(), weights.end());

  for (double x:distribution.probabilities()) std::cout << x << " ";
  std::cout << std::endl;

  int p[15]={};

  for (int i=0; i<nrolls; ++i) {
    int number = distribution(generator);
    ++p[number];
  }

  std::cout << "a discrete_distribution:" << std::endl;
  for (int i=0; i<15; ++i)
    std::cout << i << ": " << std::string(p[i]*nstars/nrolls,'*') << std::endl;

  return 0;
}

this gives:

7.43082e-25 9.70789e-20 5.20354e-15 9.8546e-11 5.36065e-07 
0.000618363 0.0936591 0.811444 0.0936591 0.000618363 5.36065e-07
9.85459e-11 5.10703e-15 0 0

a discrete_distribution:
0: 
1: 
2: 
3: 
4: 
5: 
6: *********
7: ********************************************************************************
8: *********
9: 
10: 
11: 
12: 
13: 
14:

Note the asymmetry, especially the zeros at the end. I can't see what I've done wrong. Is there something wrong with the code, or is some rounding taking place that I can't see. Thanks.

kηives
  • 214
  • 2
  • 7
  • 6, 7, and 8 are `16.1967, 140.325, 16.1967,` which vastly outweighs the other indexes. – NathanOliver Jul 02 '18 at 16:43
  • @NathanOliver Obviously. But simple division should still hold for symmetric values. – kηives Jul 02 '18 at 16:46
  • They are symetric, 6 and 8 are equal and 7 is off the charts. Just as your weights suggest it should be. The chances of 5 or 9 even getting a result is .061% which is incredibly small when you are only doing 10000 iterations – NathanOliver Jul 02 '18 at 16:46
  • @NathanOliver His issue is that the last internal weights are zero and not symmetric like the input weights. Ignore the random roll result. – Max Langhof Jul 02 '18 at 16:47
  • @NathanOliver I'm doing Monte Carlo runs with many sweeps. This is a minimum working example. – kηives Jul 02 '18 at 16:48
  • @MaxLanghof Ah. That makes sense. – NathanOliver Jul 02 '18 at 16:50
  • @kηives You need more precision on your output. With it you will see they are symmetrical: http://coliru.stacked-crooked.com/a/2ad62fa1c1c49018 – NathanOliver Jul 02 '18 at 16:51
  • [I was unable to reproduce using OnlineGDB's compiler](https://onlinegdb.com/H1O2PCwfX). This may reflect a bug in your compiler/library implementation. – Xirema Jul 02 '18 at 16:52

1 Answers1

0

The problem appears to be floating point math. In particular, it seems likely that the distribution keeps a running total while normalizing its weights, leading it to lose the tiny probabilities at the end. In a simple example, assume that doubles could only store 2 significant digits (the reality is closer to 16) and you weights were 0.001, 1.0, 1.0 and 0.001:

It would sum the weights to 2.002 (which it can only represent as 2.00), then went ahead and normalized the weights. The first one becomes 0.001/2.00 = 0.0005. Then next one is 0.5, running total 0.5005 (i.e. 5.00). The third weight is also 0.5, so the total is 1.00 already. The difference to the allowed sum is 0.00, so it cannot give positive weight to the last event.

I am aware that this is not a perfect example (because the weights still don't fully add up), but I hope you get the point - your standard library implementation and/or your floating point settings are messing up your results here due to cancellation. Not that getting your event of probability 1e-20 to ever happen is within realms of reason, but you are right that it should theoretically keep the symmetry.

For those who say "there was not enough precision to print it": I disagree, because the values should ideally still be symmetric and the first value is not printed as 0, unlike the last. Seeing as only those values smaller than one ULP around 1 are printed as zero, I stand by cancellation as the problem.

Max Langhof
  • 23,383
  • 5
  • 39
  • 72
  • 1
    It's possible that internally, the implementation is storing only the cumulative probabilities (for use in a binary search on a uniform distributed number) and then `probabilities()` just constructs a vector of differences on the fly. So at the beginning the cumulative numbers are small enough to keep the precision, but at the end they're all very close to 1. – Daniel Schepler Jul 02 '18 at 22:27
  • @DanielSchepler That is a better reasoning. In that case, the returned probabilities will be exact (i.e. the last two events actually have probability zero). – Max Langhof Jul 03 '18 at 16:07