3

Given a query and a term, how could I calculate the average position of the term within every document in the query and return it? I am looking for the fastest (performance wise) solution and willing to extend the solr functionality.

Following that, I would need to calculate the average position of a term accross all documents in the query. With that, I do not need to return the documents themesleves to the client - just the average term position.

Thanks Saar

Saar
  • 1,753
  • 6
  • 20
  • 32
  • please define "average position of a term" with an example. – phanin Oct 20 '13 at 03:39
  • assume we have to documents: hello my name is phani and hello I was called phani by my parents then the average position of the term "my" in these document set is (2+7)/2 – Saar Oct 21 '13 at 19:19

2 Answers2

2

One of the solutions is to do the following (QUITE A LOT OF CODING - I'm not aware of a shortcut as you need to traverse term positions within documents. There is no built-in functionality to do so via functions, but you also may think of using Payloads somehow).

  1. Create your own query type, extending the basic TermQuery.
  2. For TermsQuery the scoring logic boils down to traversing the TermsEnum object created with your term. You can use the DocsAndPositionsEnum to enumerate all the positions of the specific term in each document.
  3. I assume you don't care about the Lucene similarity calculation (do you?). Then you may simply return the average position in a specific document as a 'score'
  4. The tricky part is to return average information across your set without returning the documents themselves. I would try to use the StatsComponent, which returns basic statistics for a certain field in the result set. I don't know if it can work with a 'score' field, or any other calculated field. If it doesn't, try altering the QueryComponent to calculate the average and set it as a result instead of the documents. If you expect to run this thing within a cluster (distributed search), you would also have to override the distributed query behavior so that you calculate the average from all of the shards.

Perhaps another option is to alter the indexing logic and calculate those averages in analysis stage. If you manage to do so (putting it into payload), you can fetch this information much faster in query time, but it means developing a sophisticated analysis filter.

lexk
  • 761
  • 3
  • 7
1

If I understand you correctly, you would like to compute arithmetic mean of all positions of a term in the document-set returned for a particular query.

Here's what I could come up with.

First of all, you must enable positional information while indexing to extract any positional info from the index.

Take a look at this component: The Term Vector Component

  • Supply your query
  • Supply tv.positions=true.
  • Supply rows=veryBigNumber as they mentioned in Solr rows parameter

The response would contain what you would need to compute arithmetic mean.

Please do not forget to specify the term you are looking for in the query. For example: q:(field1:someExQueryIfNeeded AND field2:targetTerm)

Make sure that you retrieve minimal stuff you need. If you end up receiving a lot of noise, you can always customize this component as a Solr Plugin and return only the info you need.

phanin
  • 5,327
  • 5
  • 32
  • 50