example from LSHForest, results not convinced

Question

The library and corresponding documentation is following -- yes i read everything and being able to "run" on my own codes.

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html

But the results do not really make sense to me, so i went through the example (which is included at the previous webpage as well)

    >>> from sklearn.neighbors import LSHForest
    >>> X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
    >>> X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]
    >>> lshf = LSHForest()
    >>> lshf.fit(X_train)  
    LSHForest(min_hash_match=4, n_candidates=50, n_estimators=10,
              n_neighbors=5, radius=1.0, radius_cutoff_ratio=0.9,
              random_state=None)
    >>> distances, indices = lshf.kneighbors(X_test, n_neighbors=2)
    >>> distances                                        
        array([[ 0.069...,  0.149...],
               [ 0.229...,  0.481...],
               [ 0.004...,  0.014...]])
    >>> indices
        array([[1, 2],
               [2, 0],
               [4, 0]])

so i just try to verify the example by finding the nearest neighbors for the the three testing sets [9, 1, 6], [3, 1, 10], [7, 10, 3]

Say searching nearest neighbors for [9,1,6] (by using Euclidean distance), the closest training points are [5, 5, 2] and [6, 10, 2] (which i think the indices would [0.4]) -- that is significantly different to the results [1,2]

the distances also completely off the topic by the simple math calculation, my excel sheet is attached

thanks again for your time and help

In cosine distance, which is what LSHForest supports, those results are exact nearest neighbors. — joeln, Jun 25 '15 at 06:43

score 1 · Accepted Answer · answered Jun 24 '15 at 09:01

1

It's not wrong, since LSHForest implements ANN (approximate near neighbor), and maybe that's the difference we need to take into consideration. The ANN-results are not the nearest neighbors, but an approximation of what the nearest neighbor should be.

For example, a 2-nearest neighbor result looks like:

from sklearn.neighbors import NearestNeighbors

X_train = [[5, 5, 2], [21, 5, 5], [1, 1, 1], [8, 9, 1], [6, 10, 2]]
X_test = [[9, 1, 6], [3, 1, 10], [7, 10, 3]]

nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X_train)
distances, indices = nbrs.kneighbors(X_test)

and returns

indices
Out[2]: 
array([[0, 2],
       [0, 2],
       [4, 3]], dtype=int64)

distances
Out[3]: 
array([[ 6.92820323,  9.43398113],
       [ 9.16515139,  9.21954446],
       [ 1.41421356,  2.44948974]])

If it helps, checkout this and notice that it mentions:

given a query point q, if there exists a point within distance r from q, then it reports a point within distance cr from q. Here c is the approximation factor of the algorithm.

The point at distance 'r' and the point returned doesn't have to be the same.

Hope this helps.

answered Jun 24 '15 at 09:01

Aramis7d

2,444
19
25

thanks for the explanation, but why the "distance" output for lshforest is about 100 order smaller than the actual Euclidian distance? not sure how the distnace is claculated for LSHForest. :) – user381509 Jun 24 '15 at 10:36
3

Currently LSHForest implementation in scikit-learn only provides approximation on cosine distance. So what you get is not Euclidean distance, but cosine distance. You can use Nearest neighbors to get the Euclidean distances where you can choose what distance metric you want. – maheshakya Jun 24 '15 at 12:53
The approximateness is irrelevant - it's the wrong metric! – joeln Jun 25 '15 at 06:44
@joeln, the metric may not be suitable to your purposes, in which case it'll be wrong to use this method, but how can you say "it's the wrong metric" without any context? If you're referring to it not satisfying the triangle inequality, then it states clearly in the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LSHForest.html) that the distance value depends only on the relative angles of the vector points. – Aramis7d Jun 25 '15 at 10:16
No, I mean it's the wrong metric in that one can't expect results true of Euclidean distance when the algorithm implements cosine. – joeln Jun 27 '15 at 10:22

example from LSHForest, results not convinced

1 Answers1