1

I'm trying to calculate the probability of a bivariate normal distribution over a specific area respectively a specific polygon in java.

The mathematical description would be to integrate the probability density function (pdf) of the bivariate normal distribution over a specific complex area.

My first approach was to use two NormalDistribution objects with the aid of the apache-commons-math library. Given dataset x for dimension 1 and dataset y for dimension 2 I've computed mean and standard deviation for each NormalDistribution.

With the method public double probability(double x0, double x1) from org.​apache.​commons.​math3.​distribution.​NormalDistribution I'm able to set an individual interval for each dimension, which means I can define a rectangular area and get the probability by

NormalDistribution normalX = new NormalDistribution(means[0], stdDeviation_x);
NormalDistribution normalY = new NormalDistribution(means[1], stdDeviation_y);

double probabilityOfRect = normalX.probability(x1, x2) * normalY.probability(y1, y2);

If the standard deviations are small enough and the defined region is large enough, the probability will approach to a number of 1.0 (0.99999999999), which is expected.

As I've said I need to compute a specific area, my first approach won't work this way because I'm only able to define rectangular areas.

So my second approach was to use the class MultivariateNormalDistribution, which is also implemented in apache-commons-math.

By using the MultivariateNormalDistribution with the vector means and the covariance matrix, I'm able to get the pdf of a specific point x with public double density(double[] vals), like the description is saying

Returns the probability density function (PDF) of this distribution evaluated at the specified point x.

http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/distribution/MultivariateNormalDistribution.html#density(double[])

In this approach I'm converting my complex area in an ArrayList of Points and subsequently summing up all the densities by iterating over the ArrayList like this:

MultivariateNormalDistribution mnd = new MultivariateNormalDistribution(means, covariances);
double sum = 0.0;
    for(Point p : complexArea) {
    double[] pos = {p.x, p.y};
    sum += mnd.density(pos);
}
return sum;

But I've encountered a problem with lacking precision when setting the standard deviations to really low values so that the pdf is containing peaks > 1 at the position I'm calling mnd.density(pos). So the sum is adding up to values > 1.

To avoid these peaks I'm trying to sum up the average of a summed up value which are the surrounding points in double precision of the current point by

MultivariateNormalDistribution mnd = new MultivariateNormalDistribution(means, covariances);
double sum = 0.0;
for(Point p : surfacePoints) {
    double tmpRes = 0.0;

    for(double x = p.x - 0.5; x < p.x + 0.5; x+=0.1) {
        for(double y = p.y - 0.5; y < p.y + 0.5; y+=0.1) {
            double[] pos = {x, y};
            tmpRes += mnd.density(pos);
        }
    }
    sum += tmpRes / 100.0;
}

return sum;

which obviously works.

All in all I'm not quite sure if my approaches are fundamentally correct. Another approach would be to compute the probability with numerical integration but I'm clueless how to achieve this in java.

Are there any other possibilities to achieve this?

EDIT: Beside the fact of lacking accuracy, the main question is: Is the second approach "summing up the densities" a valid method to obtain a probability in an area of a bivariate normal distribution? Thinking about 1-dimensional normal distributions, the probability of one specific point is always 0. How does the public double density(double[] vals) method in the apache math library obtain a valid value?

horge
  • 11
  • 5
  • How did you select the points you sum on (in your arraylist) ? What kind of geometric structure are you considering ? And do you expect the result to be always close to 1 ? – vib Apr 29 '15 at 12:19
  • The points are grabbed from a canvas where a polygon can be drawn. The ArrayList is then filled up by simply overlaying a grid with int precision over the whole polygon and checking whether a grid point is contained in the polygon or not. The geometric structure is any kind of polygon except nonsimple polygons. The result is not expected to be always close to 1. – horge Apr 29 '15 at 13:11
  • I'm surprised that you are getting problems only with large standard deviations. Usually, smaller standard deviations cause greater problems. I suspect that you are confusing the standard deviation with its inverse, since you do divide by the standard deviation in some formulas. What you are doing is essentially numerical integration by a very coarse method, using 1x1 squares, and if that's not good enough, you should use one of the more accurate methods of numerical integration. – Douglas Zare Apr 29 '15 at 13:13
  • Have you tried, instead of grabbing points, to grab rectangles contained in your polygon, to exactly evaluate the contribution of the gaussian within this rectangle, and then to sum it up ? – vib Apr 29 '15 at 13:19
  • @DouglasZare: I'm getting problems with small standard deviations, as I wrote "when setting the standard deviations to really low values [...]" which is causing problems like a probability greater than 1. – horge Apr 29 '15 at 13:30
  • Oh, so when you say, "If the standard deviations are small enough and the defined region is large enough, the probability will approach to a number of 1.0 (0.99999999999)" you mean that is the theoretical result, not what you observe in your calculations? It is strange that you remind us that 1.0 is close to 0.99999999999; you should not expect people who don't know or remember this to be able to help you integrate. When you mention 0.99999999999 I expect you to say that this is the output your algorithm produces under favorable circumstances, which would not be with small standard deviations. – Douglas Zare Apr 29 '15 at 13:37
  • I'd wonder if another integration scheme (e.g. Gauss quadrature) would be a better idea here. I'd do a quick study to see what order quadrature would be sufficient. I'd also wonder if symmetry can be exploited. – duffymo Apr 29 '15 at 13:42
  • @DouglasZare: thanks for pointing that out. I was testing my calculations by setting up a bnd with very small standard deviations and a huge area so that the normal distribution was completely included in the area. The output was 0.999... and I justified it with the theoretical result. – horge Apr 29 '15 at 13:55
  • You could also generate number from the Gaussian distribution and check whether they belong to your polygon or not ... probably not the fastest but easy to code. See for instance http://mathfaculty.fullerton.edu/mathews/n2003/montecarlopimod.html – vib Apr 29 '15 at 14:07
  • When you say you encounter problems with “really low” standard deviations, how low is that? A multivariate normal only has one peak, and it is allowed to be greater than 1. My initial thought is that you are forgetting to multiply by `dx dy` (assuming you’re points come from a rectangular grid). – Teepeemm Apr 29 '15 at 14:25
  • @Teepeemm: It's not stated directly in the problem but the OP is sampling only at points with integer coordinates. So, dx dy = 1. And it's a very coarse numerical integration technique that could be improved in many ways. – Douglas Zare Apr 29 '15 at 14:38

2 Answers2

3

Your current approach is to perform a numerical integral by sampling at points with integer coordinates, assigning the value at each point to the whole square. This has two main sources of error. One is that the function may vary a lot within the square. Another is the boundary, where you integrate over squares not completely contained in the region. A third source of error is roundoff, but this will rarely be significant since the other sources of errors are huge.

One simple way to reduce the error is to use a finer grid. If you sample at points with coordinates that are integers divided by n (and multiply by the area n^-2 of the 1/n by 1/n squares), this will reduce both sources of errors. A problem is that you sample at about n^2 as many points.

I suggest writing your double integral over the region as an integral of integrals.

The inner integral (say, with respect to x) will be an integral of a one-dimensional Gaussian over an interval, if the region is convex, or at worst over a finite list of integrals. You integrate the pdf restricted to a particular y coordinate y0 along the intersection of the polygon with the horizontal line y=y0. You can evaluate the inner integrals using functions such as erf, which is numerically approximated in libraries, or you can do it yourself using a one-dimensional numerical integral.

The outer integral (say, with respect to y) naturally breaks up into pieces. Where there is a point of the polygon, the function inside the outer integral might not be smooth. So, break up the outer integral by the y-coordinates of the vertices of the polygon, and do a numerical integral such as the trapezoid rule or Simpson's Rule on each of the intervals. These require you to evaluate the inner integral at a few points in each interval and weight them appropriately.

This should produce much more accurate results for a given amount of time than simply refining the grid.

Douglas Zare
  • 3,296
  • 1
  • 14
  • 21
  • I'm generally in agreement, but from OP's description, it sounds like the joint density is just the product of marginal densities, in which case only 1-d integrals (computed via erf) are needed. – Robert Dodier Apr 29 '15 at 18:01
  • @Robert Dodier: I'm not sure if the covariance matrix is really diagonal or whether that part was a mistake. However, if the polygon is not an axis-aligned rectangle, how are you getting the double integral directly from the 1-dimensional CDFs? – Douglas Zare Apr 29 '15 at 18:26
  • Yes, you're right. Still, I'd be surprised if this weren't a problem with a known solution, although I wouldn't be surprised if numerical integration is simpler than an exact solution. – Robert Dodier Apr 29 '15 at 19:11
  • I've looked at the problem before, and I think it is the exception rather than the rule to have a simple expression for the double integral. There are published articles on approximating the double integral for particular simple shapes. – Douglas Zare Apr 29 '15 at 19:56
  • Thanks for your advice and your description. I've edited my question to point out the main questions I'm struggling with. – horge Apr 30 '15 at 09:03
  • @horge: You are trying to draw my attention to the question "Is this a valid method at all?" As I stated several times, when you sum up the densities, you are doing a very coarse numerical integral where you approximate the value of the function on a 1x1 square by the value at one point of the square. This is like a Riemann sum in 2 dimensions. So, there are no problems with the units, it is just inaccurate. Normally, you take smaller subdivisions to increase the accuracy, and I explained how to do that where I mentioned 1/n by 1/n squares. The method I described is more efficient than that. – Douglas Zare Apr 30 '15 at 09:59
  • The 1-dimensional analogue is that we can approximate the integral of f(x) on, say, [0,3] as f(0) + f(1)+f(2). It would be more accurate to use 1/2(f(0)+f(1/2)+f(1)+f(3/2)+f(2)+f(5/2)), or 1/10(f(0)+f(1/10)+...+f(2.9)). The first sum uses intervals of length 1, so you multiply the value of the function at each point by 1. – Douglas Zare Apr 30 '15 at 17:49
0

See:

Didonato, A. R., Jarnagin, Jr., M. P., & Hageman, R. K. (1980). Computation of the Integral of the Bivariate Normal Distribution Over Convex Polygons. SIAM Journal on Scientific and Statistical Computing, 1(2), 179–186. doi:10.1137/0901010

(If your polygon is not convex, there is another paper in the same issue that handles the general case.)

nth
  • 1,442
  • 15
  • 12