I have a data set (1D) link: dataset, which has values ranging from 21,000 to 8,000,000. When i plot histogram of the log values, i can see there are two peaks, roughly. I tried to fit Gaussian Mixture using sklearn package in Python. I tried to find best n_components based on lowest AIC/BIC. With Full covariance_type, best is is 44 with BIC, 98 with AIC ( i only tested up to 100). But once i use these numbers i got very poor fit. Also, i tested all other covariance_types, but i failed to fit to my data. I tried just 2, i got a much better fit.
import pandas as pd
import numpy as np
from sklearn.mixture import GaussianMixture as GMM
import matplotlib.pyplot as plt
df = pd.read_excel (r'Data_sets.xlsx',sheet_name="Set1")
b=df['b'].values.reshape(-1,1)
b=np.log(b)
####### finding best n_components ########
k= np.arange(1,100,1)
clfs= [GMM(n,covariance_type='full').fit(b) for n in k]
aics= [clf.aic(b) for clf in clfs]
bics= [clf.bic(b) for clf in clfs]
plt.plot(k,bics,color='orange',marker='.',label='BIC')
plt.plot(k,aics,color='g',label='AIC')
plt.legend()
plt.show()
And here is my attempt to plot histogram of my data + density pdf of the fitted Gaussian mixture
clf=GMM(38,covariance_type='full').fit(b)
n, bins, patches = plt.hist(b,bins='auto',density=True,color='#0504aa',alpha=0.7, rwidth=0.85)
xpdf=np.linspace(b.min(),b.max(),len(bins)).reshape(-1,1)
density= np.exp(clf.score_samples(xpdf))
plt.plot(xpdf,density,'-r')
print("Best number of K by BIC is", bics.index(min(bics)))
print("Best number of K by AIC is", aics.index(min(aics)))
here i ploted histograms with bins=50, top histogram is for the orginal data set =3915; the bottom one from 10,000 samples using n_components=44 as advised by BIC. it looks GMM(44) fits well.
My question, where is the mistake that leads to these results (1) Does it because of my data is not suitable to Gaussian mixture? (2) are my implementations wrong? I appreciate the help or suggestion to fix the issue. With the update (histogram plots), it looks like GMM fits the data well. However, i cannot see why the first plot hist+kde fit bad. I guess because both hist and kde use different y scale, but not sure. Thanks