Technique For Comparing Items in a Set with Varying Numbers of Attributes Possibly Using LSH

Asked Feb 07 '19 at 18:36

Active Feb 08 '19 at 03:05

Viewed 33 times

I have a data set containing millions of items collected from many disparate sources. Each item contains a list of anywhere from fifty to a thousand attributes. The specific attributes available vary greatly from item to item.

I am looking for the best way to find the most similar items to a given target item of members in the set. (I obviously want to accomplish this without doing a brute-force comparison against all of the items in the set.)

I would like to use something like Locality Sensitive Hashing with MinHash. However, if the target item has 50 attributes and a likely matching item within the larger data set has 200, MinHash will consider these as dissimilar even if the item with 200 attributes contains all of the attributes of the target item.

What are the best techniques or algorithms to use to compare items with dissimilar numbers of attributes?

edited Feb 08 '19 at 03:05

asked Feb 07 '19 at 18:36

Anthony Gatlin

4,407
5
37
53

regardless of efficiency, do you have any distance metric in mind that would capture the similarity of two items? – Ameer Jewdaki Feb 08 '19 at 00:10

Technique For Comparing Items in a Set with Varying Numbers of Attributes Possibly Using LSH

0 Answers0