Identifying the most relevant document using lucene

Question

I am trying to solve the following search problem. Say we have 10 different documents d1..d10 Each document contains a type of data say, d1 -> list of movie names, d2 -> list of actor names, d3 -> list of addresses etc. Each document contains list of entities and scores. So d1 contains movie names and their popularity etc. Assume the scores are all normalized(0-max_score across the documents)

Now given a search query(phrase), I want to score the 10 documents based on how relevant is is to the search phrase.

My question is if using lucene is a good way to approach this? I plan to index each phrase with its score into separate document inside lucene and then query for the top match.

I don't want to search for the individual entities. I am okay with getting the over all score of entity type for a given search phrase. For example if some one searches of lord of the rings, I need to be able to say that it is most likely a movie and not a actor or address. My goal is minimize space consumption and optimize performance

Should work fine, I don't see anything problematic there. You can search by your text or title field, and sort on your score field. — femtoRgon, Apr 30 '14 at 02:57
Thanks. Another question I have is if I should index each phrase(inside di) into separate lucene document or if there is a way to index the entire document di into one lucene document taking into account the scores of each individual phrase with in that document. Basically I want to consume less space and optimize performance as this is lot of data. At the end I need to only return the document id(1..10) — Kamal, Apr 30 '14 at 03:07
How you index things should be informed by the units you want to search for. Sounds like you want to search for entities within your larger documents, that is a single movie title, etc., so those movie titles (and other attendant information) should be your documents. — femtoRgon, Apr 30 '14 at 03:12
I don't want to search for the individual entities. I am okay with getting the over all score of entity type for a given search phrase. For example if some one searches of lord of the rings, I need to be able to say that it is most likely a movie and not a actor or address. — Kamal, Apr 30 '14 at 03:19

Identifying the most relevant document using lucene

0 Answers0