0

In my exploratory analysis, I am currently using Gaussian Mixture Model to exclude outliers by plotting the contour plot for each phone model (a total of 4 unique models). I am using these 2 variables (r_max, b_max) to detect whether a point is an outlier.

This is my dataset (180 rows, 4 columns):

r_max | b_max | SPAD | model
255.0 | 46.0  | 35.1   | Redmi 5A
198.0 | 36.0  | 32.5   | Vivo 1820
237.0 | 77.0  | 35.8   | CPH1920
255.0 | 79.0  | 30.1   | SM-M105F

This is my code to create a contour plot:

for ph in final_df.model.unique():

    dataf = final_df[final_df.model==ph]
    
    dataf = dataf[['r_max',  'b_max', 'SPAD', 'model']]

    dataf = dataf[['r_max', 'b_max']].values
    #dataf = np.array(dataf).reshape(-1,1)

    # fit gaussian mixture model 
    gmm = mixture.GaussianMixture()
    gmm.fit(dataf)

    X, Y = np.meshgrid(np.linspace(0, 300), np.linspace(0, 300))
    XX = np.array([X.ravel(), Y.ravel()]).T
    Z = gmm.score_samples(XX)
    Z = Z.reshape(X.shape)
    CS = plt.contour(X, Y, -Z, norm=LogNorm(vmax=500), levels=np.logspace(0, 3, 10))
    CB = plt.colorbar(CS, shrink= 1.0, extend = 'both')
    plt.scatter(dataf[:,0], dataf[:,1], marker = "x", cmap='viridis')
    plt.title(f"{ph} : r_max vs b_max")
    #ax.set_title(f"{feats[idx]} vs SPAD")
    plt.show()

I can create the plots for all models except CPH1920. In this line:

CS = plt.contour(X, Y, -Z, norm=LogNorm(vmax=500), levels=np.logspace(0, 3, 10))

I get this error:

ValueError: min value must be less than or equal to max value

Example of output i get for Redmi 5A: enter image description here

I'm not sure if I am understanding it correctly, but the points outside the purple circle are outliers right?

Next question, how do I extract which data points are the outliers? I would like to know which points are considered outliers, so I can exclude them in my dataset for EDA. Is there some way to calculate the probability of each point being an outlier then set a threshold if >0.5 is an outlier?

Hoping someone can lend some help here, any help is appreciated!

  • 2
    Probably best to focus on one question at a time. If you're trying to understand why you get an error for CPH1920, maybe one thing would be to look at summary stats for its r_max and b_max values, and for the Z values. – TMBailey Aug 18 '21 at 07:32
  • You also need to add tags to your question based on what you're using. i.e. pandas, numpy, matplotlib, etc. – martineau Aug 18 '21 at 08:10
  • Thanks so much for the advice! However, my main question would be how do I extract which data points are classified as outliers according to GMM? I need to know what points are these so I can exclude them in my exploratory analysis. – user16285826 Aug 19 '21 at 06:11

1 Answers1

0

As requested in the comments I'll only answer the question on how to exclude the outliers. You already used everything you need in your code for the contour plot. The function gmm.score_samples takes an array of data points and returns an array of probabilities under the GMM. Now you can simply threshold them as you suggested:

threshold = 0.5
probs = gmm.score_samples(dataf)
not_outliers = dataf[probs > threshold]

The array no_outliers should now only contain the data points above your threshold. You have to experiment a bit on how high you should set it.

tilman151
  • 563
  • 1
  • 10
  • 20