0

BIC curve after GMM clustering

I want to use BIC criterion to find the optimal number of clusters for GMM clustering. I plotted the BIC scores for cluster numbers 2 to 41, and get the attached curve. I have no idea how to interpret this, can someone help?

For reference, this is the code I used to do GMM clustering. It is applied to daily wind vector data over a region, totaling approximately 5,500 columns and 13,880 rows.

def gmm_clusters(df_std, dates):
    ks = range(2, 44, 3)
    bic_scores = []
    csv_files = []
    for k in ks:
        model = GaussianMixture(n_components=k,
                                n_init=1,
                                init_params='random',
                                covariance_type='full',
                                verbose=0,
                                random_state=123)
        fitted_model = model.fit(df_std)
        bic_score = fitted_model.bic(df_std)
        bic_scores.append(bic_score)
        labels = fitted_model.predict(df_std)
        print("Labels counts")
        print(np.bincount(labels))
        df_label = pandas.DataFrame(df_std)
        print("############ dataframe AFTER CLUSTERING ###############")
        df_dates = pandas.DataFrame(dates)
        df_dates.columns = ['Date']
        df_dates = df_dates.reset_index(drop=True)
        df_label = df_label.join(df_dates)
        df_label["Cluster"] = labels
        print(df_label)
        csv_file = "{0}_GMM_2_Countries_850hPa.csv".format(k)
        df_label.to_csv(csv_file)
        csv_files.append(csv_file)

    return ks, bic_scores, csv_files

Thank you!!

EDIT: Using K-means on the same data, I get this elbow plot (plot of SSE): enter image description here This is fairly clear to interpret, indicating that 11 clusters is the optimum.

1 Answers1

1

The first thing that springs to mind is check the numbers of clusters below 10 with a step of 1, not 3. Maybe there is a dip in BIC you are missing there.

The second thing is maybe check aic vs bic. See here: https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other

The third thing is that your dataset has 5,500 dimensions, but only 13,880 points. There is less than 3 points per dimension. I would be surprised to find any clustering at all (which is what the BIC chart is indicating). You'd need to tell more about the data and what each column means and what clustering you are looking for.

Evgeny Tanhilevich
  • 1,119
  • 1
  • 8
  • 17
  • Thanks, I will try with under 10 clusters too with step of 1. I had also tried clustering with K-means and got the above results (see the edit), there is evidence of clustering. Unfortunately, it's not possible to reduce dimensions of the dataset since each column corresponds to a lat-long location within my domain, I have to include them all. – Mridula Gunturi Oct 14 '21 at 15:33
  • Also, is there any way to find SSE using GMM, as I did with K-means? Thank you! – Mridula Gunturi Oct 14 '21 at 15:34
  • For SSE, you could compute inertia manually for GMMs, as defined here: https://scikit-learn.org/stable/modules/clustering.html#inertia – Evgeny Tanhilevich Oct 14 '21 at 15:42