0

Two classes dataset

This is the synthetic classification data set with data from the two classes shown in red and blue. The blue class is generated from a single Gaussian while the red class comes from a mixture of two Gaussians.

Since we have the prior probabilities (p(C0)=0.5 and p(C1)=0.5) and the class-conditional probabilities (a single Gaussian p(x|C0) and a mixture of two Gaussians p(x|C1)), we can calculate the true posterior probabilities and plot the contour lines and filled contours as shown on the right. But how to plot the minimum misclassification-rate decision boundary (the green line)?

The data is generated as :

import numpy as np
import matplotlib.pyplot as plt

def create_toy_data(mu1, mu2, mu3, sigma1, sigma2, sigma3):
    x0 = np.random.multivariate_normal(mu1, sigma1, 100)
    x1 = np.random.multivariate_normal(mu2, sigma2, 50)
    x2 = np.random.multivariate_normal(mu3, sigma3, 50)
    return np.concatenate([x0, x1, x2]), np.concatenate([np.zeros(100, dtype='int'), np.ones(100, dtype='int')])

I know the minimum misclassification-rate decision boundary is p(C0|x)=p(C1|x)=0.5, but how to represent the curve explicitly?

merv
  • 67,214
  • 13
  • 180
  • 245
Charles
  • 21
  • 8
  • Are you looking for the functional form of that specific decision boundary, or how to get an approximation by building a machine learning model? – Dimosthenis Oct 31 '18 at 19:08
  • In a general sense, it appears what you want is to plot the implicit function p(C1|x) = 0.5 (or equivalently p(C0|x) = 0.5). Given the location and shape parameters for the Gaussian blobs, you can construct a function which returns p(C1|x) for any x = (x1, x2) where x1, x2 are the two dimensions of the input space. You would want to plot the implicit function p(C1|(x1, x2)) = 0.5 over the input space. A brief web search suggests Matplotlib isn't best for that; someone suggested Sympy (http://sympy.org). Good luck and have fun. – Robert Dodier Oct 31 '18 at 22:43
  • @Dimosthenis The former. Is it possible to plot such decision boundary through a explicit function when all the related probabilities are known? – Charles Oct 31 '18 at 23:54
  • @RobertDodier You got it. I'd also like to know whether the implicit function can be represented explicitly or not. – Charles Nov 01 '18 at 00:00
  • When there's just one Gaussian bump for each class, the decision boundary is a conic section. With more than one bump per class, I don't think there is any simple characterization. – Robert Dodier Nov 01 '18 at 00:22

0 Answers0