BucketRandomProjectionLSH KNN parameters

Question

I am trying to use KNN algorithm from spark 2.2.0. I am wondering how I should set my bucket length. The record count/number of features varies, so I think it is better to set length by some conditions. How should I set the bucket length for better performance? I rescaled all the features in vector into 0 to 1.

Also, is there any way to guarantee KNN algorithm to return minimum number of elemnets? I found out that sometimes number of elements inside the bucket is smaller than queried k, and I might want at least one or two neighbors as result.

Thanks~

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH

score 2 · Answer 1 · answered Oct 02 '17 at 10:46

2

According to Scaladocs

If input vectors are normalized, 1-10 times of pow(numRecords, -1/inputDim) would be a reasonable value

answered Oct 02 '17 at 10:46

pauli

4,191
2
25
41

For those who doesn't see this above quoted sentence, try to expand the `Parameters -> val bucketLength` – cinqS Dec 10 '18 at 09:58

BucketRandomProjectionLSH KNN parameters

1 Answers1