1

Basically we want to be able to search in various subsets of a large document repository. We are thinking about using a multivalued field to store for each document which subsets it's currently in, and filter on this field when searching. The problem is that the subsets are constantly changing, so we have to frequently add new subsets and remove old subsets from this field.

I have read that when updating a field in a Solr document, I have to update the whole document, and the document is updated by deleting the old copy and adding a new copy. So frequent updates will cause a lot of deleted copies and bloat the internal lookup table, and performance degrades.

My question is how serious is this degradation? And is there any better way to approach this problem? This should be a common problem after all, what immediately comes to mind are searching for articles with a specific tag and searching in a user's favorite articles (although our own use case is more complex).

I have looked at the ExternalFileField a bit but it seems that it doesn't support multivalued fields (I hope I'm wrong), and there are too many different combinations of subsets to use one integer to represent a combination (to transform the multivalued field into a single-valued field).

Gary Chang
  • 1,042
  • 12
  • 18
  • 1
    How many documents are you indexing? And how many updates per hour? Solr is a rather fast, so for 10k documents, you can simply use the `optimize` command (which regenerates a new, efficient index) periodically. I can vouch that it takes a few seconds and an imperceptible amount of disk activity for 10k document-scale. – Jesvin Jose Mar 05 '12 at 05:26
  • Unfortunately, the scale is closer to 10m, and the update frequency can also be high as they are user-generated. – Gary Chang Mar 05 '12 at 06:08
  • Could you talk more about your "subsets" and how queries relate to them? You could set up stuff such that the document metadata changes less often, while frequent changes can be handled at query-time; I discussed one such question http://stackoverflow.com/questions/9222835/solr-permissions-filtering-results-depending-on-access-rights/9224624#9224624. If you can find no other way to reduce write-rate, Elastic search (also Lucene based, not related to Amazon) or Solandra (seems exotic) may handle your needs. – Jesvin Jose Mar 05 '12 at 07:55
  • Have you tried it? A lot of things which seem theoretically problematic for performance don't have a real impact. And I doubt using the filesystem via `ExternalFileField` will be any faster. – Xodarap Mar 06 '12 at 18:17

0 Answers0