0

I want to implement a textual search with Lucene over some documents. The documents are provided already tokenized in a table:
|documentID|token|position|
'documentID' is the id of the document the token is in
'position' describes, on which position in the document the token is written

My first attempt was to create an index to search for tokens and return the documentID. So I created a Lucene-document for every documentID. To each Lucene-document I added one IntField 'documentID' and several StringFields 'token' for every token in this document. Therefor, no problem.

Now I started my second attempt, including position information. First thought: 'No Problem, just add StringFields 'position' to the Lucene-document!' But...Then I have no relation between the positions and the tokens. Here's an example what I want in the end:
INPUT:
tomato
OUTPUT:
docID1|position1
docID1|position2
docID2|position1
...

How can I achieve this? In my opinion, the simplest solution would be stopping converting document->Lucene-document and starting to map the single tokens to Lucene-documents.
So I create a Lucene-document for every unique token/documentID combination (of course just for documentIDs containing the token). Then I add the token and the documentID as fields to the document. Also, for every hit an IntField containing 'position' from the table.
Example:

StringField 'token1'
IntField 'documentID1'
IntField 'position1'
IntField 'position2'
IntFIeld 'position3'

StringField 'token1'
IntField 'documentID2'
IntField 'position1'
IntField 'position2'

Are there other ways to store linked fields?

EarlGrey
  • 531
  • 7
  • 29

1 Answers1

0

Why do you want to use Lucene for this? Do you have further requirements which need some Lucene features? If not, you just need a "Map" data structure. The key is the token. The value contains the list of the document ids along with the position list.

fatih
  • 1,395
  • 10
  • 9
  • I want to implement a search engine. In the future I want to do some information extraction on this data, so I will add more tables with the occurences of names, places, events,...Although, a Lucene index is persistent and optimized on containing thousands of entries. A map is just here on runtime, but it should be persistent and efficient. In the end I want to do what you suggested, but in Lucene. – EarlGrey Feb 23 '14 at 22:24