0

I have a set of reviews and I've clustered them with k-means and got the clusters each review belongs to (Ex: 1,2,3...). I also have the real labels of which clusters these belongs to Ex: location, food etc.) and I need to compare them with Rand index.

As I have cluster numbers and cluster labels how I can I apply Rand index to compare?

Is there any intermediate step that I should follow?

Edit: I've seen the post Rand Index function (clustering performance evaluation) but it does not answer my question.

In that question, you have

labels_true = [1, 1, 0, 0, 0, 0]
labels_pred = [0, 0, 0, 1, 0, 1]

but what I have is something like below,

labels_true = ['food', 'view', 'room', 'food', 'staff', 'staff']
labels_pred = [0, 0, 0, 1, 0, 1]

Any help is highly appreciated.

lse23
  • 1,267
  • 2
  • 10
  • 19

1 Answers1

1

Just use the sklearn.metrics.rand_score function:

from sklearn.metrics import rand_score

rand_score(labels_true, labels_pred)

It doesn't matter if true labels and predicted labels have values in different domains. Please have a look at the examples:

>>> rand_score(['a', 'b', 'c'], [5, 6, 7])
1.0
>>> rand_score([0, 1, 2], [5, 6, 7])
1.0
>>> rand_score(['a', 'a', 'b'], [0, 1, 2])
0.6666666666666666
>>> rand_score(['a', 'a', 'b'], [7, 7, 2])
1.0
Riccardo Bucco
  • 13,980
  • 4
  • 22
  • 50
  • It seems like Jaccard similarity cannot be applied when the true values and predicted values are in different domains. @Riccardo Bucco do yo have an idea of how to handle this scenario? – lse23 Nov 25 '21 at 19:22
  • @lse23 please open another question :) – Riccardo Bucco Nov 25 '21 at 20:53