1

So assuming I have a precomputed distance matrix


    1       2       3       4       5
1   0.000   1.154   1.235   1.297   0.960   
2   1.154   0.000   0.932   0.929   0.988
3   1.235   0.932   0.000   0.727   1.244
4   1.297   0.929   0.727   0.000   1.019
5   0.960   0.988   1.244   1.019   0.000

which is actually in the size of 100,000 x 100,000 items (which are actually molecules). The distances are the similarities of the molecules with 0 being basically equal and 2 being completely unalike. My goal is to cluster these into groups of similar compounds and to be able to pick the "most representative" member of each cluster for further analysis. Even though there are many, many clustering algorithms out there and i tried to understand them and get them to work I still failed. Neither do I know which one to pick nor to get a "tutorial" on how to run them.

As a cheminformatics guy the result most attractive for me is similar to the spheres (and centroids) similar to sphere-exclusion clustering/Taylor-Butina clustering. I'll be very very glad about any input, hints or whatsoever pointing me in a direction or to helpful resources. I tried to get the SparseHC tool to run, and it does something but due to a lack of documentation (or my lack of understanding the underlying algorithms and math in the paper) the results do not help me. Many, many thanks in advance!

Philipp O.
  • 41
  • 1
  • 4

1 Answers1

0

Perhaps, AgglomerativeClustering could solve your problem.

data = [
[0.000,  1.154,  1.235,  1.297,  0.960],  
[1.154,  0.000,  0.932,  0.929,  0.988],
[1.235,  0.932,  0.000,  0.727,  1.244],
[1.297,  0.929,  0.727,  0.000,  1.019],
[0.960,  0.988,  1.244,  1.019,  0.000]
]

# If you have an idea about how many clusters you are expecting:
from sklearn.cluster import AgglomerativeClustering
clusterer = AgglomerativeClustering(n_clusters=3, metric="precomputed", linkage="average", distance_threshold=None)
clusters = clusterer.fit_predict(data)
print(clusters)
>> array([2, 0, 0, 0, 1])


# If you do NOT have an idea on how many clusters you are expecting. 
# then you need to define a distance_threshold
from sklearn.cluster import AgglomerativeClustering
clusterer = AgglomerativeClustering(n_clusters=None, metric="precomputed", linkage="average", distance_threshold=None)
clusters = clusterer.fit_predict(data)
print(clusters)
>> array([2, 3, 4, 1, 0])
Vandan Revanur
  • 459
  • 6
  • 17