0

I am getting to know how lucene function getTermFreqVector() works while computing the cosine theta similarity distance betweeen two documents. Can anyone shed some light on what the does "field-name" mean in getTermFreqVector(doc number, field-name)

1 Answers1

0

An inverted index like lucene indexes data in such a way that allows you to execute very performant searches by term. You index documents, which are collections of fields. A field is just a key value pair: field name, field value.

You can easily retrieve what documents contain a specific word, but retrieving all the indexed terms for a specific document becomes harder, since the terms enum is stored per field, but not per document. The term vector overcomes this problem allowing to store that information per document, so that you can retrieve it in a performant way, paying the price of having a bigger index.

Back to your question: the term vectors are stored per document, per field, that's why you have to provide both the document id and the field name in order to retrieve it.

javanna
  • 59,145
  • 14
  • 144
  • 125