2

I have a Solr schema with a kind of versioning. IDs contain version number, so existing docs remain as new are indexed. Sample contents:

id = foo1
name = foo
version = 1
data = x

id = foo2
name = foo
version = 2
data = y

id = bar1
name = bar
version = 1
data = x

There are two distinct search scenarios: Search all versions or search only the latest. The first is trivial, but how do I implement a search in the data field for only the latest versions of each name? In the sample above I wish to search for "x" in latest, and expect to hit only "bar1".

I was hoping for a solution using http://wiki.apache.org/solr/FieldCollapsing, but if I search for "x" with group.field=name Solr will group after search, giving me version 1 of the two names above. I would need it to work more like a filter query.

javanna
  • 59,145
  • 14
  • 144
  • 125
solsson
  • 1,521
  • 2
  • 12
  • 19

1 Answers1

2

Dont think field collapsing would serve you the purpose.

I can think of couple of the options -

  1. Generate an unique same id for the document, so that when you add the new current document the old one is overwritten and you have only one version of the document always.
  2. If its possible to maintain an extra field for the documents which would indicate the status as CURRENT. Only the latest document would have the field value and you would need to reset the value for all the other version of the documents. This way you can easily filter out the latest documents by filter query and also search through all version with the filter query.
Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • #1 would make my first search scenario impossible. Regarding #2, is there a good way to mark the old document as not CURRENT when adding a new one? I think I'd have to reindex it. – solsson Sep 19 '11 at 05:19
  • #2 handling needs to be done from the database (or your source side), and would surely need reindexing all the versions of the documents again. – Jayendra Sep 19 '11 at 05:55
  • OK. It can be plan B, while hoping for better options. An alternative is to index immediately to two cores, one that overwrites (your option #1). Docs can be large though, so I'd want it to be a single submit. – solsson Sep 19 '11 at 06:12