I am very new to the index structure of Lucene so please tell me if this makes sense or if I am trying to use a hammer to drill a hole.
Main Point / Overview
I believe I need to overwrite Lucene's term frequency with a number of my own (i.e. a value [0,100] that represents a probability (i.e. [0,1]) or another number that serves as a measure of evidence that can take the place of term frequency. Is it possible to overwrite the term frequency value at index time so that the number is actually stored inside the Lucene index (instead of the normal term frequency Lucene is using)?
In Detail:
I have files that may not contain text or very little text. Instead there are mostly (or treated as) digital artifacts with meta information. This meta information are learned conceptual probabilities that is obtained from classifiers and other machine learning methods (e.g. based on object recognition, color histograms, or a combination of evidences). Here a very simple example where an image was classified (with high probability) as containing a tree and also depicting a house.
filepath: /pics/1.jpg
meta: tree = 0.9
meta: house = 0.8
meta: dog = 0.0
... (up to 10000 meta fields)
an another one shows a dog, a house, but no tree.
filepath: /pics/2.jpg
meta: tree = 0.0
meta: house = 0.3
meta: dog = 1.0
... (up to 10000 meta fields)
Each meta tag is stored in a separate document field called 'meta' to make all of them searchable by directing the search to it. Each field contains the concept as a word or phrase and is treated as one token.
So, I have primarily external sources of evidence for what picture 1 and 2 are about that I am aware that this is mostly outside the realm of the classic TF-IDF paradigm. I would like to insert these probabilities (for the 'meta' field) into the scoring scheme of Lucene to search for these meta information tokens and bring therr probabilities into the score just like TF-IDF. If I search for the meta:tree AND meta:dog I want to find the second document and this can be achieved if the scoring uses these new probability TFs. So, if I can modify the TF of each of these meta concepts (tree, house and dog) with their probabilities, then I can include this into Lucene without changing all the rest.
Does this make sense? Does Lucene provide such a low level modification on the index? Am I heading in the right direction?