Clustering given pairwise distances with unknown cluster number?

Question

I have a set of objects {obj1, obj2, obj3, ..., objn}. I have calculated the pairwise distances of all possible pairs. The distances are stored in a n*n matrix M, with Mij being the distance between obji and objj. Then it is natural to see M is a symmetric matrix.

Now I wish to perform unsupervised clustering to these objects. After some searching, I find Spectral Clustering may be a good candidate, since it deals with such pairwise-distance cases.

However, after carefully reading its description, I find it unsuitable in my case, as it requires the number of clusters as the input. Before clustering, I don't know the number of clusters. It has to be figured out by the algorithm while performing the clustering, like DBSCAN.

Considering these, please suggest me some clustering methods that fit my case, where

The pairwise distances are all available.
The number of clusters is unknown.

What is wrong with DBSCAN? It does **not** need to know the number of clusters... — Has QUIT--Anony-Mousse, Sep 21 '13 at 17:38
@Anony-Mousse I am not sure whether it is suitable for the "relative distances" — Sibbs Gambling, Sep 22 '13 at 01:36
you never mentioned "relative distances". And what is that, anyway? Distances are always relative values... And in fact, DBSCAN only needs a binary "near" decision, see generalized DBSCAN. — Has QUIT--Anony-Mousse, Sep 22 '13 at 07:55
@Anony-Mousse sorry! typo. Should be "pairwise distances" instead — Sibbs Gambling, Sep 22 '13 at 08:13
Well, DBSCAN clearly can handle that. DBSCAN is distance based, not coordinates-based (k-means needs coordinates, to compute a mean - but DBSCAN does not have this limitation). So you really should **try DBSCAN**. — Has QUIT--Anony-Mousse, Sep 22 '13 at 13:29
@Anony-Mousse Cool! I really appreciate your help! I will go and try it out! — Sibbs Gambling, Sep 22 '13 at 14:16

score 7 · Accepted Answer · edited Sep 21 '13 at 17:34

7

There are many possible clustering methods, and none of them can be considered "best", everything depends on the data, as always:

If you would like to use spectral clustering, but do not know the number of clusters before hand I suggest taking a look at the self-tuning spectral clustering or some methods of determining the number of clusters
If you consider other algorithms you could try:

edited Sep 21 '13 at 17:34

Has QUIT--Anony-Mousse

76,138
12
138
194

answered Sep 20 '13 at 10:15

lejlot

64,777
8
131
164

Note that it's DBSCAN, not DBScan. It's an abbreviation, N is for Noise, for example. It's not really "scanning" a database, but it works best with index structures to accelerate search. +1 for listing the most popular ones. – Has QUIT--Anony-Mousse Sep 21 '13 at 17:37

score 3 · Answer 2 · answered Sep 22 '13 at 15:18

3

You can try multidimensional scaling (MDS). After you use MDS to convert the distance-like data into a geometrical picture, you can apply common clustering methods (like k-means) for clustering. See here and here for more.

answered Sep 22 '13 at 15:18

chaohuang

3,965
4
27
35

score 2 · Answer 3 · edited Apr 13 '17 at 12:44

2

Clustering methods that require the number of clusters a priori are much more common than those that try to estimate the number of clusters. You might get better answers at Cross Validated. In the meantime, however, a couple of recent approaches to the problem are:

Estimating the number of clusters in a data set via the gap statistic by Tibshirani, Walther and Hastie, which compares the change in within-cluster dispersion with the number of clusters against the expected change for an appropriate reference null distribution. There is an R implementation of this approach.
Cluster Validation by Prediction Strength by Tibshirani and Walther, which views "clustering as a supervised classification problem, in which we must also estimate the 'true' class labels. The resulting 'prediction strength' measure assesses how many groups can be predicted from the data, and how well."

edited Apr 13 '17 at 12:44

Community

1
1

answered Sep 20 '13 at 05:43

Simon

10,679
1
30
44

Thanks for the answer! Maybe I failed to make the question very clear, actually the "pairwise distance" is of more importance in my case. That is, we can first ignore the unknown-class-number issue. Any other comments on the "pairwise distance"? – Sibbs Gambling Sep 20 '13 at 14:22
Pairwise distance is a typical measure of the dissimilarity between the items. Some measure of the dissimilarity between each pair of items is required as input to every clustering algorithm that I've used but there are other dissimilarity measures that are reasonable in some cases, e.g. the square of the distance between each pair. – Simon Sep 20 '13 at 23:31

score 2 · Answer 4 · answered Sep 20 '13 at 12:29

Another approach that nobody has suggested thus far, if you like probabilistic clustering, is Bayesian non-parametrics (Dirichlet process priors being the simplest case). You can use multinomial likelihood for count-type data, or multivariate Gaussian likelihood if your data are continuous.

score 2 · Answer 5 · answered Jan 17 '18 at 19:16

It's easy to do with the metric='precomputed' argument in sklearn clustering algorithms. You fit the model with the pairwise distance matrix rather than original features.

The idea how to do this is the following (for the case when you need to create a pairwise distance matrix too):

def my_metric(x, y):
   # implement your distance measure between x and y

def create_pairwise_dist(X_data):
   # create a matrix of pairwised distances between all elements in your X_data
   # for example with sklearn.metrics.pairwise.pairwise_distances
   # or scipy.spatial.distance.pdist
   # or your own code

X_data = <prepare your data matrix of features>
X_dist = create_pairwise_dist(X_data)

# then you can use DBSCAN

dbscan = DBSCAN(eps=1.3, metric='precomputed')
dbscan.fit(X_dist)

Alexander Kuznetsov · Answer 6 · 2013-09-20T09:07:35.003

1

You can try to use hierarchical clustering. It has two types:

Agglomerative or "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
Divisive or "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

edited Sep 20 '13 at 09:07

answered Sep 20 '13 at 05:35

Alexander Kuznetsov

3,062
2
25
29

score 1 · Answer 7 · answered Oct 22 '13 at 09:30

Have you considered Correlation Clustering?
If you read carefully section 2.1 in that paper you'll see a probabilistic interpretation to the recovered number of clusters.

The only modification you need for your M matrix is to set a threshold deciding what distance is considered "same" and what distance is too large and should be considered as "not-same".

Section 7.2 in the aforementioned paper deals with a clustering of a full matrix where the recovering of the underlying number of clusters is an important part of the task at hand.

Clustering given pairwise distances with unknown cluster number?

7 Answers7

Linked