In my exploratory analysis, I am currently using Gaussian Mixture Model to exclude outliers by plotting the contour plot for each phone model (a total of 4 unique models). I am using these 2 variables (r_max
, b_max
) to detect whether a point is an outlier.
This is my dataset (180 rows, 4 columns):
r_max | b_max | SPAD | model
255.0 | 46.0 | 35.1 | Redmi 5A
198.0 | 36.0 | 32.5 | Vivo 1820
237.0 | 77.0 | 35.8 | CPH1920
255.0 | 79.0 | 30.1 | SM-M105F
This is my code to create a contour plot:
for ph in final_df.model.unique():
dataf = final_df[final_df.model==ph]
dataf = dataf[['r_max', 'b_max', 'SPAD', 'model']]
dataf = dataf[['r_max', 'b_max']].values
#dataf = np.array(dataf).reshape(-1,1)
# fit gaussian mixture model
gmm = mixture.GaussianMixture()
gmm.fit(dataf)
X, Y = np.meshgrid(np.linspace(0, 300), np.linspace(0, 300))
XX = np.array([X.ravel(), Y.ravel()]).T
Z = gmm.score_samples(XX)
Z = Z.reshape(X.shape)
CS = plt.contour(X, Y, -Z, norm=LogNorm(vmax=500), levels=np.logspace(0, 3, 10))
CB = plt.colorbar(CS, shrink= 1.0, extend = 'both')
plt.scatter(dataf[:,0], dataf[:,1], marker = "x", cmap='viridis')
plt.title(f"{ph} : r_max vs b_max")
#ax.set_title(f"{feats[idx]} vs SPAD")
plt.show()
I can create the plots for all models except CPH1920. In this line:
CS = plt.contour(X, Y, -Z, norm=LogNorm(vmax=500), levels=np.logspace(0, 3, 10))
I get this error:
ValueError: min value must be less than or equal to max value
Example of output i get for Redmi 5A:
I'm not sure if I am understanding it correctly, but the points outside the purple circle are outliers right?
Next question, how do I extract which data points are the outliers? I would like to know which points are considered outliers, so I can exclude them in my dataset for EDA. Is there some way to calculate the probability of each point being an outlier then set a threshold if >0.5 is an outlier?
Hoping someone can lend some help here, any help is appreciated!