4

I am using scikit-learn to cluster some data, and I want to compare the results of different clustering techniques. I am immediately faced with the issue that the labels for the clusters are different for different runs, so even if they are clustered exactly the same the similarity of the lists is still very low.

Say I have

list1 = [1, 1, 0, 5, 5, 1, 8, 1]
list2 = [3, 3, 1, 2, 2, 3, 8, 3]

I would (ideally) like a function that returns the best mapping in the form of a translation dictionary like this:

findMapping(list1, list2)
>>> {0:1, 1:3, 5:2, 8:8}

And I said "best mapping" because let's say list3 = [3, 3, 1, 2, 2, 3, 8, 4] then findMapping(list1, list3) would still return the same mapping even though the final 1 turns into a 4 instead of a 3.

So the best mapping is the one that minimizes the number of differences between the two lists. I think's a good criterion, but there may be a better one.

I could write a trial-and-error optimization algorithm to do this, but I'm hardly the first person to want to compare the results of clustering algorithms. I expect something like this already exists and I just don't know what it's called. But I searched around and didn't find any answers.

The point is that after applying the best translation I will measure the difference between the lists, so maybe there is a way to measure the difference between lists of numbers indexed differently without finding the translation as an intermediate step, and that's good too.

===================================

Based on Pallie's answer I was able to create the findMapping function, and then I took it one step further to create a translation function that returns the second list converted to the labels of the first list.

def translateLabels(masterList, listToConvert):    
  contMatrix = contingency_matrix(masterList, listToConvert)
  labelMatcher = munkres.Munkres()
  labelTranlater = labelMatcher.compute(contMatrix.max() - contMatrix)

  uniqueLabels1 = list(set(masterList))
  uniqueLabels2 = list(set(listToConvert))

  tranlatorDict = {}
  for thisPair in labelTranlater:
    tranlatorDict[uniqueLabels2[thisPair[1]]] = uniqueLabels1[thisPair[0]]

  return [tranlatorDict[label] for label in listToConvert]

Even with this conversion (which I needed for consistent plotting of cluster colors), using the Rand index and/or normalized mutual information does seem like a good way to compare the differences that don't require a shared labeling.

I also like the idea of first sorting both lists according the values in the data, but that may not work when comparing clusters from very different data.

Aaron Bramson
  • 1,176
  • 3
  • 20
  • 34
  • 1
    I think the question justifies a certain degree of merit, even if it is purportedly considered by someone as a duplicate. I'm upvoting it. – mnm Mar 21 '19 at 12:36

1 Answers1

6

You could try calculating the adjusted Rand index between two results. This gives a score between -1 and 1, where 1 is a perfect match.

Or by taking argmax of confusion matrix:

list1 = ['a', 'a', 'b', 'c', 'c', 'a', 'd', 'a']
list2 = [3, 3, 1, 2, 2, 3, 8, 3]
np.argmax(contingency_matrix(list1, list2), axis=1)
array([2, 0, 1, 3])

2 means row 2 (the value 2, the cluster 3) best matches "a" column 0 (the index of 2). Row 0 then matches column 1, etc.

enter image description here

For the Hungarian method:

m = Munkres()
contmat = contingency_matrix(list1, list2)
m.compute(contmat.max() - contmat)
[(0, 2), (1, 0), (2, 1), (3, 3)]

using: https://github.com/bmc/munkres

Pallie
  • 965
  • 5
  • 10
  • Yes, this is the kind of thing that I expected existed. It does satisfy my need to measure the distance between lists without finding the translation. If possible I'd still like to have the "translation map" so I can identify which items are different and produce a global color scheme for plots. – Aaron Bramson Mar 20 '19 at 11:32
  • 2
    For that information try the contingency matrix: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cluster.contingency_matrix.html – Pallie Mar 20 '19 at 11:46
  • I believe you that the contingency matrix can provide the information necessary to convert one set of labels to another, but with just that documentation and no examples to be found, it's totally unclear how that can be done. – Aaron Bramson Mar 21 '19 at 11:38
  • Taking the np.argmax of the contingency matrix should result in a list where the indices are the key and the values are the values of the mapping.a good mapping. – Pallie Mar 21 '19 at 12:52
  • Well, for my sample data, `contingency_matrix` outputs `[[1 0 0 0] [0 0 4 0] [0 2 0 0] [0 0 0 1]]` and applying `argmax` yields `[0 2 1 3]`, but I don't see how to translate those two lists into the same indices from that output. I mean, what should `5` translate to? What about 8? Sorry, I'm just not seeing how to get this to do the job. – Aaron Bramson Mar 21 '19 at 14:21
  • 1
    I think it's worth specifying that the Munkres algorithm might not be a good idea if the two partitions have different cardinalities (e.g. one ground-truth cluster is split in two by a given clustering procedure). In this case, the Munkres algorithm will leave some clusters un-matched and the argmax technique (albeit suboptimal) is the good way to go if one needs a complete assignment map. – AlessioX Mar 19 '20 at 07:30