3

I'm aiming at providing one-search-box-for-everything model in search engine project, like LinkedIn.

I've tried to express my problem using an analogy.

Let's assume that each result is an article and has multiple dimensions like author, topic, conference (if that's a publication), hosted website, etc.

Some sample queries:

  • "information retrieval papers at IEEE by authorXYZ": three dimensions {topic, conf-name, authorname}

  • "ACM paper by authoABC on design patterns" : three dimensions {conf-name, author, topic}

  • "Multi-threaded programming at javaranch" : two dimensions {topic, website}

I've to identify those dimensions and corresponding keywords in a big query before I can retrieve the final result from the database.

Points

  • I've access to all the possible values to all the dimensions. For example, I've all the conference names, author names, etc.
  • There's very little overlap of terms across dimensions.

My approach (naive)

  • Using Lucene, index all the keywords in each dimension with a dedicated field called "dimension" and another field with actual value. Ex:

    1) {name:IEEE, dimension:conference}, etc.

    2) {name:ooad, dimension:topic}, etc.

    3) {name:xyz, dimension:author}, etc.

  • Search the index with the query as-it-is.
  • Iterate through results up to some extent and recognize first document with a new dimension.

Problems

  • Not sure when to stop recognizing the dimensions from the result set. For example, the query may contain only two dimensions but the results may match 3 dimensions.
  • If I want to include spell-checking as well, it becomes more complex and the results tend to be less accurate.

References to papers, articles, or pointing-out the right terminology that describes my problem domain, etc. would certainly help.

Any guidance is highly appreciated.

phanin
  • 5,327
  • 5
  • 32
  • 50

2 Answers2

2

Solution 1: Well how about solving your problem using Natural Language Processing Named Entity Recognition (NER). Now NER can be done using simple Regular Expressions (in case where the data is too static) or else you can use some Machine Learning Technique like Hidden Markov Models to actually figure out the named entities in your sequence data set. Why I stress on HMM as compared to other Machine Learning Supervised algorithms is because you have sequential data with each state dependent on the previous or next state. NER would output for you the dimensions along with the corresponding name. After that your search becomes a vertical search problem and you can just search for the identified words in different Solr/Lucene fields and set your boosts accordingly.

Now coming to the implementation part, I assume you know Java as you are working with Lucene, so Mahout is a good choice. Mahout has an HMM built in and you can train+test the model on your data set. I am also assuming you have large data set.

Solution 2: Try to model this problem as a property graph problem. Check out something like Neo4j. I suggest this as your problem falls under schema less domain. Your schema is not fixed and problem very well can be modelled as a graph where each node would be a set of key value pairs.

Solution 3: As you said that you have all possible values of dimensions than before anything else why not simply convert all your unstructured data from your text to structured data by using Regular Expressions and again as you do not have fixed schema so store the data in any NoSQL key value database. Most of them provided Lucene Integrations for full text search, then simply search on those database.

Yavar
  • 11,883
  • 5
  • 32
  • 63
1

what you need to do is to calculate the similarity between the query and the document set you are looking in. Measures like cosine similarity should serve your need. However a hack that you can use is calculate the Tf/idf for the document and create an index using that score from there you can choose the appropriate one. I would recommend you to look into Vector Space Model to find a method that serves your need!! give this algorithm a look aswell http://en.wikipedia.org/wiki/Okapi_BM25

Moiz
  • 26
  • 5
  • Vector Space Model, TF-IDF, BM25 etc. are the fundamental building blocks of any search engine. OP is using Lucene and hence using already all the stuff you have mentioned. – Yavar Oct 10 '13 at 18:21