1

I am trying to use KNN algorithm from spark 2.2.0. I am wondering how I should set my bucket length. The record count/number of features varies, so I think it is better to set length by some conditions. How should I set the bucket length for better performance? I rescaled all the features in vector into 0 to 1.

Also, is there any way to guarantee KNN algorithm to return minimum number of elemnets? I found out that sometimes number of elements inside the bucket is smaller than queried k, and I might want at least one or two neighbors as result.

Thanks~

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH

Yong Hyun Kwon
  • 359
  • 1
  • 3
  • 15

1 Answers1

2

According to Scaladocs

If input vectors are normalized, 1-10 times of pow(numRecords, -1/inputDim) would be a reasonable value

pauli
  • 4,191
  • 2
  • 25
  • 41
  • For those who doesn't see this above quoted sentence, try to expand the `Parameters -> val bucketLength` – cinqS Dec 10 '18 at 09:58