2

I have an array of (n_sample x 2) and I want to cluster them using KDTree in sklearn.neighbors.KDTree.

I have this sample piece of code:

from sklearn.neighbors import KDTree
import numpy as np
np.random.seed(0)
X = np.random.random((10, 2))
tree = KDTree(X, leaf_size=2)

Now I want to extract the points in the leaves of the tree so that each leaf can be a cluster. Points that are in the same leaf belong to the same cluster.

In the above example because the maximum leaf_size is 2, we'll have about 10 / 2 = 5 clusters.

What I desire is that given a point in X (e.g. X[0]) the tree can give me the index of the leaf of the tree that the points belongs to.

Ash
  • 3,428
  • 1
  • 34
  • 44
  • 1
    The kd-tree is not well suited for clustering. – Has QUIT--Anony-Mousse Apr 01 '16 at 07:46
  • Not in my case, KD-tree is well suited for the type of clustering I need. As noted in this paper http://aclweb.org/anthology/P15-2104. – Ash Apr 02 '16 at 23:52
  • 1
    I would not use the term "clustering" for that. It's an adaptive grid; they don't mention what they do with non-leaf users; and it's straightforward to express and implement this without a kd-tree simply with median splitting. – Has QUIT--Anony-Mousse Apr 03 '16 at 08:00

2 Answers2

1

The maximum leaf size 2 means you can have anywhere from n to n/2 users per leaf. But you forgot about the non-leaf nodes.

A kd-tree will have 1 element in the root, 2 in the second layer (that are not close), and then you will have 4 leaf nodes with the remaining 7 objects. So by looking on the leaves only, you lost three objects.

A kd-tree does not attempt to cluster points. It's perfectly valid for a kd-tree to have the exact same coordinates in two nodes! The reference you gave used the kd-tree solely to get an adaptive grid. I don't think it is a very good approach, but it is very easy. You should just implement it yourself, so you don't build the full tree, and don't put objects into non-leaf nodes.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
1

There is this package kdtree in Python which can be installed by:

pip install --user kdtree

and can be used for clustering 2D points.

Ash
  • 3,428
  • 1
  • 34
  • 44