What indexer do I use to find the list in the collection that is most similar to my list?

Question

Lets say I have my list of ingredients: {'potato','rice','carrot','corn'}

and I want to return lists from a database that are most similar to mine:

{'beans','potato','oranges','lettuce'}, {'carrot','rice','corn','apple'} {'onion','garlic','radish','eggs'}

My query would return this first: {'carrot','rice','corn','apple'}

I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.

In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.

What technology should I use to accomplish what I want to do?

Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.

With so much data I can't really loop through it, I need to query everything at once.

score 0 · Accepted Answer · answered Jun 12 '15 at 10:21

0

I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match. If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki

Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

answered Jun 12 '15 at 10:21

BlueM

3,658
21
34

This is good information. Thanks so much for pointing me in the right direction with solr. I'm going to give this a try. – JaseC Jun 13 '15 at 01:41
OK I tried this and it returned zero results =\ Not sure where to go from here. Trying to look for a freelancer. – JaseC Jun 18 '15 at 01:09
The approach generally works. (I am very sure about it, because I had tried it before I posted the answer.) Why it didn’t work in your case: no idea. – BlueM Jun 18 '15 at 04:19
Hmm.. well I seem to be hitting a brick wall. I've escalated to the world of paid support: https://www.elance.com/j/build-schemas-two-solr-indexes-making-them-perform-as-specified/74492979/ – JaseC Jun 18 '15 at 06:55
Ended up using type="text_general" omitNorms="true" multiValued="true" in schema.xml. Had to include defType=edismax in the query and had to designate the field in df and pf. Also since I'm using open solr I had to delete the following from their default schema: `text ` The final key was to populate the field as an array not string so value was ["one","two"] etc. Working great now including the mm modifier. Thanks! – JaseC Jun 24 '15 at 04:46

What indexer do I use to find the list in the collection that is most similar to my list?

1 Answers1