How to take random projections in LSH when there are both Numerical and Categorical Data?

Question

Note : Using LSH for a Nearest Neighbor Query

Assuming the data set has 5 features (f1,f2,..,f5) Where the first 2 are Numerical and 3 are categorical. And one or many of these categories maybe something like username or subject which would be quite large to encode.

If we use Mixeducledian Distance as a distace measure and use it in the Hash Function what should be or how do I select the Random Projections for the function ?

Its ok if i have to change the HashFunction.

Sample data

f1,f2,f3,f4,f5
89,43,aa,bq,wb
23,67,cd,zd,cs
98,32,aa,wb,cc
10,20,aq,zd,wb

score 0 · Answer 1 · answered Jun 24 '15 at 09:25

0

You can try converting the categorical features into dummy features. You can check the following options:

Encoding, like this
If you have dataframes, this is straightforward

Hope it helps.

answered Jun 24 '15 at 09:25

Aramis7d

2,444
19
25

Encoding them might work when the dataSet has fewer categories If that's the case i would have stuck with ANN for hashing. Since the number of categories are very high I'm considering LSH for the nearest neighbor Query. – Vishnu667 Feb 29 '16 at 08:03
maybe elaborate about the categories/data? – Aramis7d Feb 29 '16 at 09:40
Lets say one or more of them might be something like username or account branch or anything that would be too large to just encode. – Vishnu667 Mar 03 '16 at 18:12

How to take random projections in LSH when there are both Numerical and Categorical Data?

1 Answers1