1

I have a data set containing millions of items collected from many disparate sources. Each item contains a list of anywhere from fifty to a thousand attributes. The specific attributes available vary greatly from item to item.

I am looking for the best way to find the most similar items to a given target item of members in the set. (I obviously want to accomplish this without doing a brute-force comparison against all of the items in the set.)

I would like to use something like Locality Sensitive Hashing with MinHash. However, if the target item has 50 attributes and a likely matching item within the larger data set has 200, MinHash will consider these as dissimilar even if the item with 200 attributes contains all of the attributes of the target item.

What are the best techniques or algorithms to use to compare items with dissimilar numbers of attributes?

Anthony Gatlin
  • 4,407
  • 5
  • 37
  • 53

0 Answers0