0

I am trying to build a similar product using LSH and I have following query.

My data has following schema

id: long,
title: string,
description: string,
category: string,
price: double,
inventory_count: int,
active: boolean,
date_added: datetime

Should I perform LSH on individual features separately and then combine them in some way, may be weighted average?

or

Should I go about building LSH on all features all together (basically attaching feature name while creating shingles like title_iphone, title_nexus, price_1200.25, active_1...) and then using bag-of-words approach perform LSH on this bag?

If someone can direct me to a document where I can figure out how to perform LSH on structured data like of ecommerce it will be great.

P.S. I'm planning to use spark and min-hash function in LSH. Let me know if you need any more details.

Ankit
  • 359
  • 4
  • 19

1 Answers1

1

I would go with your first approach but concatenate the binary codes we obtained from each individual LSH-hash instead of averaging them.

For instance, suppose you use 4 bits to represent the hash ( for each feature family) :

data_0:
hash(id) 0101
hash(title) 1001
hash(date_added) 0001
hash(data_0) = 0101,1001,0001
weighted_average = (5+9+1)/3 = 15/3 = 5

Now suppose your have another hash for data_1:

hash(data_1) = 111100000000
weighted_average = (15+0+0)/3= 15/3 = 5

In your retrieval process, the similarity search could be performed by first compute the hash for the query data: for instance,

hash(data_x) = 010010000011
weighted_average = (4+8+3)/3 = 15/3 = 5

Suppose you found out that data_1 and data_0 are the only two data pieces that have been hashed to the same bucket as data_x, then you only need to compute the hamming distance (which can be calculated using bitwise operator XOR) between

  • data_1 and data_x -> hamming distance = 6, similarity = 6/12
  • data_0 and data_x -> hamming distance = 3, similarity = 9/12

So in this example, data_0 is the most similar data to your query.

NOTE you will lose the similarity info encoded in individual binary codes if you average them. See the examples above, you would get the same encoding for data_1 and data_0, which is 5 or 1001. However, if you look at each individual feature, apparently data_1 is more different from data_x than data_0.

ALSO NOTE If you feel some feature family is more important thus it worths more weight, you can use more bits for that feature family.

greeness
  • 15,956
  • 5
  • 50
  • 80