1

I have a set of images and I asked on MTurk whether given two images, they belong to the same category or not (There is more application-specific nuance here but essentially we are asking whether they belong to the same category or not).

My question is how to construct cluster assignment from such answers, assume all possible pairs within the set are answered. Ideally also robust to noise (we already duplicated questions and plan to use majority vote).

One example, assuming there are three images A B C D. Assuming the answer is the following: A similar to B C similar to D A different than C B different than C A different than D B different than D

The output should be two clusters (A, B) and (C, D). Note that we do not know the number of clusters in advance and would like to infer that from the answers.

I found some related questions on SO but they are not exactly the same. For instance, they might be based on distance instead of a boolean answer (yes or no). I might be able to reduce my question to the form of distance but I suppose my question is even easier than the distance setting. Related questions here:

Clustering given pairwise distances with unknown cluster number?

https://stats.stackexchange.com/questions/2717/clustering-with-a-distance-matrix

Would be even more ideal that the algorithms have python implementation already (e.g., sklearn). But if not, I don't mind to implement by myself.

Thank you.

clwen
  • 20,004
  • 31
  • 77
  • 94

2 Answers2

3

Sounds like you want to use hierarchical clustering.

When you do, e.g., average linkage, it merges clusters such that people are most likely to consider them "similar".

You do need to put some thought into how to deal with missing information, contradicting information etc. - for example you could use similarity(x,y)=(0.5+#positiveVotes)/(1+#positiveVotes+#negativeVotes) for each pair. If the pair has not been evaluated, this yields 0.5, after one positive vote it becomes 0.75, after a negative vote 0.25, and additional votes will give you a more decided similarity (unless they disagree, of course).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
2

One could view this as a problem in graph theory where the nodes and the edges of the graph are represented by the images and the similarity between them. One could then apply a community detection algorithm (e.g. modularity maximization or hierarchical clustering as already suggested) to categorize the images.

Both sklearn and scipy have an implementation of hierarchical clustering and it seems there exists also a python implementation of Louvain method for community detection.

σηγ
  • 1,294
  • 1
  • 8
  • 15