3

Is it better to use

  • a lot of indexes (eg. for every user as your application allows that) in Lucene
  • or just one, having every document in int

... if you think about:

  • performance
  • disk space
  • health

I am using elasticsearch, therefore I am using Lucene.

maerzbow
  • 185
  • 1
  • 2
  • 9
  • are the users searching the same set of documents or a different set? If different sets, what is the nature of the documents? – Prescott Dec 22 '11 at 20:42
  • in my example every user should only see his/her documents. so only one index will be used doing a search. – maerzbow Dec 22 '11 at 20:47
  • 3
    So - separate indexes have the pro of 1. easy to wipe away a users data, 2. little concern about seeing documents that aren't theirs - simplifying your coding, 3. small(er), faster searching per user. The cons are maybe the initial start up of opening the index and more index files to deal with, which aren't really big cons in my book. – Prescott Dec 22 '11 at 20:52
  • 2
    2 other concerns to be aware of. Going the the index-per-user route, you increase disk usage as some data will be replicated per user. Going the single index route, the search scoring for a user can be skewed by document contents that belong to a different user. For example, if User A uses a rare word in his documents often, but user B only uses it once in their document, it wouldn't score as strongly for user B if the index only contained their documents. – rfeak Dec 23 '11 at 03:43
  • A similar question asked about Solr http://stackoverflow.com/questions/8592153/storing-multiple-sets-of-documents-on-single-or-multiple-cores/8593953#8593953 – Jesvin Jose Dec 23 '11 at 05:04

1 Answers1

5

In Elastic Search, I think based off your information I would use 1 index. My understanding is users are only searching there own documents, and the documents seems to be relatively similar.

Performance - When searching you can use a Filtered Query to filter to only the documents matching the user. The user id filter is cache-able, and fast.

Scalable - In Elasticsearch, you control sharding and replication at index level. Elasticsearch can handle large numbers of indexes, I just think configuring appropriate shards and replications could be valuable for the entire index.

In a single index, you can still easy wipe away data (see delete by query) , and there should be little concern of seeing others data unless you write your queries wrong. A filtered query with that filters results to only those associated with a user id is very simple. Similar in complexity to searching a different index per user.

Your exact needs might fit a different approach better. Based what I have so far, I would do choose one index though.

Andy
  • 8,841
  • 8
  • 45
  • 68