I want to implement a textual search with Lucene over some documents. The documents are provided already tokenized in a table:
|documentID|token|position|
'documentID' is the id of the document the token is in
'position' describes, on which position in the document the token is written
My first attempt was to create an index to search for tokens and return the documentID. So I created a Lucene-document for every documentID. To each Lucene-document I added one IntField 'documentID' and several StringFields 'token' for every token in this document. Therefor, no problem.
Now I started my second attempt, including position information. First thought: 'No Problem, just add StringFields 'position' to the Lucene-document!' But...Then I have no relation between the positions and the tokens. Here's an example what I want in the end:
INPUT:
tomato
OUTPUT:
docID1|position1
docID1|position2
docID2|position1
...
How can I achieve this? In my opinion, the simplest solution would be stopping converting document->Lucene-document and starting to map the single tokens to Lucene-documents.
So I create a Lucene-document for every unique token/documentID combination (of course just for documentIDs containing the token). Then I add the token and the documentID as fields to the document. Also, for every hit an IntField containing 'position' from the table.
Example:
StringField 'token1'
IntField 'documentID1'
IntField 'position1'
IntField 'position2'
IntFIeld 'position3'
StringField 'token1'
IntField 'documentID2'
IntField 'position1'
IntField 'position2'
Are there other ways to store linked fields?