I'm aiming at providing one-search-box-for-everything model in search engine project, like LinkedIn.
I've tried to express my problem using an analogy.
Let's assume that each result is an article and has multiple dimensions like author, topic, conference (if that's a publication), hosted website, etc.
Some sample queries:
"information retrieval papers at IEEE by authorXYZ": three dimensions {topic, conf-name, authorname}
"ACM paper by authoABC on design patterns" : three dimensions {conf-name, author, topic}
"Multi-threaded programming at javaranch" : two dimensions {topic, website}
I've to identify those dimensions and corresponding keywords in a big query before I can retrieve the final result from the database.
Points
- I've access to all the possible values to all the dimensions. For example, I've all the conference names, author names, etc.
- There's very little overlap of terms across dimensions.
My approach (naive)
Using Lucene, index all the keywords in each dimension with a dedicated field called "dimension" and another field with actual value. Ex:
1) {name:IEEE, dimension:conference}, etc.
2) {name:ooad, dimension:topic}, etc.
3) {name:xyz, dimension:author}, etc.
- Search the index with the query as-it-is.
- Iterate through results up to some extent and recognize first document with a new dimension.
Problems
- Not sure when to stop recognizing the dimensions from the result set. For example, the query may contain only two dimensions but the results may match 3 dimensions.
- If I want to include spell-checking as well, it becomes more complex and the results tend to be less accurate.
References to papers, articles, or pointing-out the right terminology that describes my problem domain, etc. would certainly help.
Any guidance is highly appreciated.