9

I need to calculate the area where two functions overlap. I use normal distributions in this particular simplified example, but I need a more general procedure that adapts to other functions too.

See image below to get an idea of what I mean, where the red area is what I'm after:

enter image description here

This is the MWE I have so far:

import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

# Generate random data uniformly distributed.
a = np.random.normal(1., 0.1, 1000)
b = np.random.normal(1., 0.1, 1000)

# Obtain KDE estimates foe each set of data.
xmin, xmax = -1., 2.
x_pts = np.mgrid[xmin:xmax:1000j]
# Kernels.
ker_a = stats.gaussian_kde(a)
ker_b = stats.gaussian_kde(b)
# KDEs for plotting.
kde_a = np.reshape(ker_a(x_pts).T, x_pts.shape)
kde_b = np.reshape(ker_b(x_pts).T, x_pts.shape)


# Random sample from a KDE distribution.
sample = ker_a.resample(size=1000)

# Compute the points below which to integrate.
iso = ker_b(sample)

# Filter the sample.
insample = ker_a(sample) < iso

# As per Monte Carlo, the integral is equivalent to the
# probability of drawing a point that gets through the
# filter.
integral = insample.sum() / float(insample.shape[0])

print integral

plt.xlim(0.4,1.9)
plt.plot(x_pts, kde_a)
plt.plot(x_pts, kde_b)

plt.show()

where I apply Monte Carlo to obtain the integral.

The problem with this method is that when I evaluate sampled points in either distribution with ker_b(sample) (or ker_a(sample)), I get values placed directly over the KDE line. Because of this, even clearly overlapped distributions which should return a common/overlapped area value very close to 1. return instead small values (the total area of either curve is 1. since they are probability density estimates).

How could I fix this code to give the expected results?


This is how I applied Zhenya's answer

# Calculate overlap between the two KDEs.
def y_pts(pt):
    y_pt = min(ker_a(pt), ker_b(pt))
    return y_pt
# Store overlap value.
overlap = quad(y_pts, -1., 2.) 
Gabriel
  • 40,504
  • 73
  • 230
  • 404
  • 1
    http://stackoverflow.com/questions/15361125/calculate-area-between-two-curves-that-are-normal-distributions/15361352#15361352 – ev-br Dec 04 '13 at 17:42
  • I'm going over your answer at the question you linked and it appears to be applicable here too even though I initially assumed it would only work for normal distributions. Would you mind posting your comment in the form of an answer? That way if it does work, I can mark it as accepted. Thank you. – Gabriel Dec 04 '13 at 18:57
  • That answer uses quadrature - is that an option here? If Monte Carlo is necessary, then the code above needs a few changes. I wish I understood your ending comments - the sentence starting with "I get values placed directly over the KDE..." is msterious to me. – Charles Pehlivanian Dec 04 '13 at 18:57
  • Hi @CharlesPehlivanian, what I mean about "directly over" is that evaluating a point in a kernel (`ker_a` for example) returns the value of the kernel just as you would get with any other function. For example, f(x) = x^2 returns a value placed over the quadratic curve for any given x, and since I wanted to apply Monte Carlo, I needed them randomly distributed _below_ that curve. In any case this appears to be an overly complicated way to go about this. I'll update the question to reflect this after/if Zhenya posts his answer. – Gabriel Dec 04 '13 at 19:02
  • I think I've found a pretty simple answer, linked here: – Karop Feb 02 '18 at 12:10
  • I think I found a simple answer in R, linked here: – Karop Feb 02 '18 at 12:12

2 Answers2

10

The red area on the plot is the integral of min(f(x), g(x)), where f and g are your two functions, green and blue. To evaluate the integral, you can use any of the integrators from scipy.integrate (quad's the default one, I'd say) -- or an MC integrator, of course, but I don't quite see the point of that.

ev-br
  • 24,968
  • 9
  • 65
  • 78
  • The `min(f(x), g(x))` produces a `ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`. I am not sure how to address it. Should it be `min(f(x).any(), g(x).any())`? But then it produces `TypeError: 'numpy.bool_' object is not iterable` – durbachit Mar 16 '17 at 03:04
  • `min(f(x), g(x))` is hand wavy. How do you compare functions over a certain interval? Do you get a sequence of ordered x's which are very close to their neighbors and compute f(x)'s to compare? Should we use those f(x)'s to compute approximated integral or `quad`? – Xiaohong Deng May 30 '19 at 21:55
0

I think another solution would be to multiply the two curves, then take the integral. You may want to do some sort of normalization. The analogy is orbital overlap in chemistry: https://en.wikipedia.org/wiki/Orbital_overlap

wordsforthewise
  • 13,746
  • 5
  • 87
  • 117