0

Note : Using LSH for a Nearest Neighbor Query

Assuming the data set has 5 features (f1,f2,..,f5) Where the first 2 are Numerical and 3 are categorical. And one or many of these categories maybe something like username or subject which would be quite large to encode.

If we use Mixeducledian Distance as a distace measure and use it in the Hash Function what should be or how do I select the Random Projections for the function ?

Its ok if i have to change the HashFunction.

Sample data

f1,f2,f3,f4,f5
89,43,aa,bq,wb
23,67,cd,zd,cs
98,32,aa,wb,cc
10,20,aq,zd,wb
Vishnu667
  • 768
  • 1
  • 16
  • 39

1 Answers1

0

You can try converting the categorical features into dummy features. You can check the following options:

  • Encoding, like this
  • If you have dataframes, this is straightforward

Hope it helps.

Aramis7d
  • 2,444
  • 19
  • 25
  • Encoding them might work when the dataSet has fewer categories If that's the case i would have stuck with ANN for hashing. Since the number of categories are very high I'm considering LSH for the nearest neighbor Query. – Vishnu667 Feb 29 '16 at 08:03
  • maybe elaborate about the categories/data? – Aramis7d Feb 29 '16 at 09:40
  • Lets say one or more of them might be something like username or account branch or anything that would be too large to just encode. – Vishnu667 Mar 03 '16 at 18:12