I have 100 clusters, each with a mean and standard deviation value. These clusters are predefined using the SPSS software package, by using the 2-step cluster method. Therefore, the optimisation of these cluster distributions to fit the data has already been done.
For new (unseen) data, we want to assign cluster membership by selecting the maximum log-likelihood cluster, for any given set of coordinates X. To do this, I have written my own code for comparison with what was output by SPSS using the same method: https://www.norusis.com/pdf/SPC_v19.pdf
Using data that has been correctly labelled by SPSS, about 42% of the clusters are correctly labelled by minimising the RMSE to the cluster mean (which is not what SPSS does), and less than 20% of the clusters are labelled correctly by my code when assigning the maximum log-likelihood cluster (which is what SPPSS reports to do).
I know that the maximum log-likelihood cluster should be the correct cluster ( https://www.norusis.com/pdf/SPC_v19.pdf ), but there is only a 20% success rate from this code when compared to the correct cluster labels from SPSS. What am I doing wrong?
Here is the code below.
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import math
from scipy import stats
# importa raw files
clusters_df = pd.read_csv('ClusterCoordinates.csv') # clusters are in order of cluster numbers enabling us to use index for identification
clusters_df = clusters_df.drop(columns=['Cluster'])
print(clusters_df.shape)
clusters = clusters_df.to_numpy()
frames_df_raw = pd.read_csv('FrameCoordinates.csv')
frames_df = frames_df_raw.drop(columns=['frame','replica','voltage','system','ff','cluster'])
print(frames_df.shape)
frames = frames_df.to_numpy()
clusters_sd_df = pd.read_csv('ClusterCoordinates_SD.csv')
clusters_sd_df = clusters_sd_df.drop(columns=['Cluster'])
print(clusters_sd_df.shape)
clusters_sd = clusters_sd_df.to_numpy()
rmseCalc = []
llCalc = []
assignedCluster_RMSE = []
assignedCluster_LL = []
# create tables with RMSE and LL values
for frame in frames:
for cluster, cluster_sd in zip(clusters, clusters_sd):
# we compare cluster assignment using minimum RMSE vs maximum log likelihood methods.
rmseCalc.append(math.sqrt(mean_squared_error(np.array(cluster),np.array(frame))))
llCalc.append(-np.sum(stats.norm.logpdf(frame, loc=cluster, scale=cluster_sd)))
rmseCalc=np.array(rmseCalc)
llCalc=np.array(llCalc)
llCalc=np.nan_to_num(llCalc)
minRMSE = np.where(rmseCalc==rmseCalc.min())
maxLL = np.where(llCalc==llCalc.min())
print(maxLL[0][0]+1)
assignedCluster_RMSE.append(minRMSE[0][0]+1)
assignedCluster_LL.append(maxLL[0][0]+1)
rmseCalc=[]
llCalc=[]
frames_df_raw['predCluster_RMSE'] = np.array(assignedCluster_RMSE)
frames_df_raw['predCluster_LL'] = np.array(assignedCluster_LL)
frames_df_raw.to_csv('frames_clustered.csv')
I was expecting the cluster labels assigned by the code to match those already assigned by SPSS, since the methods used are intended to be the same.