I am using scikit-learn to cluster some data, and I want to compare the results of different clustering techniques. I am immediately faced with the issue that the labels for the clusters are different for different runs, so even if they are clustered exactly the same the similarity of the lists is still very low.
Say I have
list1 = [1, 1, 0, 5, 5, 1, 8, 1]
list2 = [3, 3, 1, 2, 2, 3, 8, 3]
I would (ideally) like a function that returns the best mapping in the form of a translation dictionary like this:
findMapping(list1, list2)
>>> {0:1, 1:3, 5:2, 8:8}
And I said "best mapping" because let's say list3 = [3, 3, 1, 2, 2, 3, 8, 4]
then findMapping(list1, list3)
would still return the same mapping even though the final 1
turns into a 4
instead of a 3
.
So the best mapping is the one that minimizes the number of differences between the two lists. I think's a good criterion, but there may be a better one.
I could write a trial-and-error optimization algorithm to do this, but I'm hardly the first person to want to compare the results of clustering algorithms. I expect something like this already exists and I just don't know what it's called. But I searched around and didn't find any answers.
The point is that after applying the best translation I will measure the difference between the lists, so maybe there is a way to measure the difference between lists of numbers indexed differently without finding the translation as an intermediate step, and that's good too.
===================================
Based on Pallie's answer I was able to create the findMapping function, and then I took it one step further to create a translation function that returns the second list converted to the labels of the first list.
def translateLabels(masterList, listToConvert):
contMatrix = contingency_matrix(masterList, listToConvert)
labelMatcher = munkres.Munkres()
labelTranlater = labelMatcher.compute(contMatrix.max() - contMatrix)
uniqueLabels1 = list(set(masterList))
uniqueLabels2 = list(set(listToConvert))
tranlatorDict = {}
for thisPair in labelTranlater:
tranlatorDict[uniqueLabels2[thisPair[1]]] = uniqueLabels1[thisPair[0]]
return [tranlatorDict[label] for label in listToConvert]
Even with this conversion (which I needed for consistent plotting of cluster colors), using the Rand index and/or normalized mutual information does seem like a good way to compare the differences that don't require a shared labeling.
I also like the idea of first sorting both lists according the values in the data, but that may not work when comparing clusters from very different data.