1

I have arrays of latitude and longitude data points which I want to do hierachical clustering. Here is my code:

position = zip(longitude, latitude)
X = np.asarray(position) 

knn_graph = kneighbors_graph(X, 30, include_self=False, metric= haversine)

for connectivity in (None, knn_graph):
    for n_clusters in(5,8,10,15,20):
        plt.figure(figsize=(4, 5))
        cnt = 0 
        for index, linkage in enumerate(('average', 'complete', 'ward')):
                model = AgglomerativeClustering(linkage = linkage,
                                                connectivity = connectivity,
                                                n_clusters = n_clusters)
                model.fit(X)
                plt.scatter(X[:, 0], X[:, 1], c=model.labels_,
                            cmap=plt.cm.spectral)
                plt.title('linkage=%s  (ncluster) %s)' % (linkage, n_clusters),
                                      fontdict=dict(verticalalignment='top'))
                plt.axis([37.1, 37.9, -122.6, -121.6])
  plt.show()

the problem is for kneighbors_graph there is a parameter called metric which is how we defined the destination,http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.kneighbors_graph.html I want to define my own(real distance regard to the logitude and latitude and earth radius). Let seems I could not plug in my own function, any ideas?

pyan
  • 3,577
  • 4
  • 23
  • 36
printemp
  • 869
  • 1
  • 10
  • 33
  • And `affinity="haversine"` does not work? Then use a precomputed distance matrix, or ELKI. – Has QUIT--Anony-Mousse Nov 17 '16 at 07:48
  • @Anony-Mousse, hmmm, it works now, I just forget the " "(Stupid). By the way, in the knn_graph the distance is defined by "haversine", when doing the agglomerative clustering, will the function we try to minimize, which is the "average, complete, ward distance", this distance is also the haversine distance or not? – printemp Nov 17 '16 at 18:03
  • You need to pass the `affinity` to the clusteeing. See the documentation. – Has QUIT--Anony-Mousse Nov 18 '16 at 07:15

1 Answers1

1

Note that

  • the distance function expects a string usually (e.g. "haversine")

  • you have two locations where you use a distance, then knn graph and as affinity for the clustering.

  • hierarchical clustering has two types of distances, and thus two distance parameters. One is the distance of objects (e.g. haversine), the other is the distance of clusters, which is usually derived from that other disance by aggregation (e.g. maximum, minimum). Both are often called "distance". In sklearn, the first is called affinity.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194