Lucene: Overwrite Term Frequency at Index Time

Question

I am very new to the index structure of Lucene so please tell me if this makes sense or if I am trying to use a hammer to drill a hole.

Main Point / Overview

I believe I need to overwrite Lucene's term frequency with a number of my own (i.e. a value [0,100] that represents a probability (i.e. [0,1]) or another number that serves as a measure of evidence that can take the place of term frequency. Is it possible to overwrite the term frequency value at index time so that the number is actually stored inside the Lucene index (instead of the normal term frequency Lucene is using)?

In Detail:

I have files that may not contain text or very little text. Instead there are mostly (or treated as) digital artifacts with meta information. This meta information are learned conceptual probabilities that is obtained from classifiers and other machine learning methods (e.g. based on object recognition, color histograms, or a combination of evidences). Here a very simple example where an image was classified (with high probability) as containing a tree and also depicting a house.

filepath: /pics/1.jpg
meta: tree = 0.9
meta: house = 0.8
meta: dog = 0.0
... (up to 10000 meta fields)

an another one shows a dog, a house, but no tree.

filepath: /pics/2.jpg
meta: tree = 0.0
meta: house = 0.3
meta: dog = 1.0
... (up to 10000 meta fields)

Each meta tag is stored in a separate document field called 'meta' to make all of them searchable by directing the search to it. Each field contains the concept as a word or phrase and is treated as one token.

So, I have primarily external sources of evidence for what picture 1 and 2 are about that I am aware that this is mostly outside the realm of the classic TF-IDF paradigm. I would like to insert these probabilities (for the 'meta' field) into the scoring scheme of Lucene to search for these meta information tokens and bring therr probabilities into the score just like TF-IDF. If I search for the meta:tree AND meta:dog I want to find the second document and this can be achieved if the scoring uses these new probability TFs. So, if I can modify the TF of each of these meta concepts (tree, house and dog) with their probabilities, then I can include this into Lucene without changing all the rest.

Does this make sense? Does Lucene provide such a low level modification on the index? Am I heading in the right direction?

There are a number of options. Can you give more detail about what you're trying to do? This may be relevant http://stackoverflow.com/questions/8880396/boosting-lucene-terms-when-building-the-index — bcoughlan, Oct 28 '14 at 12:00
I elaborated the text to make it more clear. I hope it helps. — RalfB, Oct 28 '14 at 12:55

score 0 · Accepted Answer · answered Oct 28 '14 at 14:47

0

How about subclassing DefaultSimilarity and overriding the tf method?

Have you read the information about scoring in the Lucene doco?

answered Oct 28 '14 at 14:47

Martin Wilson

3,386
1
24
29

I believe the DefaultSimilarity (and all its decedents) only works on stats that are already in the index (e.g. TF). I do not want to look up information outside the index every time I run a query for all documents... – RalfB Oct 28 '14 at 15:43
I think it can be used to influence the score stored in the Lucene document: "Changing Similarity is an easy way to influence scoring, this is done at index-time with IndexWriterConfig.setSimilarity(Similarity)" (from http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/package-summary.html#scoring) – Martin Wilson Oct 28 '14 at 15:53

score 0 · Answer 2 · answered Nov 02 '20 at 20:43

0

This question was used as supporting evidence for LUCENE-7854 and the ability to provide your own term frequencies was added to Lucene 7.0.

To use it, use the DelimitedTermFrequencyTokenFilter in your analyzer.

answered Nov 02 '20 at 20:43

Luke Francl

31,028
18
69
91

Lucene: Overwrite Term Frequency at Index Time

2 Answers2