0

I have some 2d data about 2 classes and I am trying to compute the log odds:

ln(P(class=a|x)/P(class=b|x))

Then I want to plot the decision boundary, namely all the points that have log odds = 0. I have done this for 1d data but for 2d data, my intuition is I have to use a 2d histogram to get P(x) and P(x|class = a), P(x | class = b). Is what I am doing correct? One question I have is where do I get P(class = a)? Is it just 0.5 because there are 2 classes with equal number of samples? I also think the way I plot the decision boundary might be wrong as it is not really what I expected.

N = 1000

mean_a = [0, 0]
cov_a = [[2, 0], [0, 2]]  # diagonal covariance

mean_b = [1, 2]
cov_b = [[1, 0], [0, 1]]  # diagonal covariance

#generate data
Xa = np.random.multivariate_normal(mean_a, cov_a, N)
Xb = np.random.multivariate_normal(mean_b, cov_b, N)
Xall = np.vstack((Xa,Xb))

def logratio(a, b, eps=1e-14): 
    # take log ( ratio of probabilities of (y vs not-y) )   
    a=a+eps # to prevent taking logs of 0 or infinity
    b=b+eps # to prevent taking logs of 0 or infinity
    return np.log(a/b)

P_a = 0.5 # since each class has equal number of samples
P_b = 0.5

(P_xn_if_a, x_bins, y_bins) = np.histogram2d(Xa[:, 0], Xa[:, 1])
(P_xn, x_bins, y_bins) = np.histogram2d(Xall[:, 0], Xall[:, 1])
(P_xn_if_b, x_bins, y_bins) = np.histogram2d(Xb[:, 0], Xb[:, 1])

P_b_if_xn = P_xn_if_b * P_a / (P_xn + 1e-16)
P_a_if_xn = P_xn_if_a * P_a / (P_xn + 1e-16)
log_odds = logratio(P_a_if_xn, P_b_if_xn)

#plot only boundary
for i in range(0,10):
    for j in range(0,10):
        if log_odds[i][j] != 0:
            log_odds[i][j] = 0
        else:
            log_odds[i][j] = 1



fig, ax6 = plt.subplots(nrows=1, ncols=1,figsize=(15,8))
ax6.contour(x_bins[:-1], y_bins[:-1], log_odds,levels=[0], cmap="Greys_r")
ax6.scatter(Xa[:,0],Xa[:,1],color='r')
ax6.scatter(Xb[:,0],Xb[:,1],color='b')

enter image description here

  • The log-odds for a two-class problem is some function of the input variables. When the per-class distributions are continuous functions, so is the log-odds. In the case of Gaussian distributions, the log-odds is equal to a quadratic form in the input variables plus a term not dependent on the inputs. It's easy to see that, just write down the log-odds and try to expand and simplify it. For any kind of per-class distribution, just make a contour plot of the log-odds. The decision boundaries for different thresholds are the contours. No need for histograms, just make a contour plot. – Robert Dodier Nov 28 '18 at 19:14
  • Incidentally since the log-odds for a two-class problem with Gaussian distributions is a quadratic form, the contours are conic sections; the kind of conic section is determined by the eigenvalues of the matrix in the quadratic form. – Robert Dodier Nov 28 '18 at 19:16
  • @RobertDodier Following your advice, this is what I get: https://imgur.com/a/gVP5UvJ . Does this look ok? Also I was expecting a staright line to separate the 2 classes. –  Nov 28 '18 at 20:55
  • Well, that's not a contour plot of the log-odds for a 2-class problem with Gaussian distributions for each class, so that's not it. – Robert Dodier Nov 28 '18 at 20:59
  • @RobertDodier Oh I forgot to say I only want to plot every point x in R^2 that has log-odds 0 associated with it –  Nov 28 '18 at 21:01
  • Well, in that case you want to display a specific contour. I don't know how to arrange that with the plotting package you are using. A plot of the form f(x, y) = c where c is a constant is also called an implicit plot -- perhaps that keyword will help you find resources. – Robert Dodier Nov 28 '18 at 21:04

0 Answers0