java - MongoDB + Solr performances

Question

I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.

What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):

1- When inserting an object in MongoDB, I then add it to Solr

SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();

2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like

StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();

3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList I go through each document and:

get the id of the document
get the object from MongoDB having the same id to be able to retrieve the properties from there

4- When deleting, well I haven't done that part yet and not really sure how to do it in Java

So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after

How big is each document and the properties you want to index? — Justin Thomas, Aug 30 '11 at 14:52
@JustinThomas - well each document can have around 10 properties, some of them can be long descriptions and I'd like to index for full-text search on the description, just exact matching on the other ones. Does that answer your question? — Guillaume, Aug 30 '11 at 16:33

jpountz · Accepted Answer · 2011-08-31T12:14:17.257

Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).

In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).

Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).

You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that

there are typical (especially for sorts in order to load the fieldcaches) but not too complex queries in the firstSearcher and newSearcher parameters in your solrconfig.xml config file,
useColdSearcher is set to
- false in order to have good search performance, or
- true if you want changes performed to the index to be taken faster into account at the price of a slower search.

Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.

For more information about Solr performance factors, see http://wiki.apache.org/solr/SolrPerformanceFactors

To delete documents, you can either delete by document ID (as defined in schema.xml) or by query : http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html

good point on the `deleteById`, I actually did not see it (didn't even try to I must say, I assumed there was something more complicated). Since you seem to know a lot about this, a few more question if you don't mind: 1. how much is a good RAM buffer? 2. I didn't change the firstSearcher and newSearcher for the example solrconfig.xml file, are they good? 3. finally, I have one instance of solr running under tomcat, 5 cores in it. Does it change anything regarding performances to have more than one instance of solr running? thanks for you help — Guillaume, Sep 03 '11 at 01:27
1. You need to perform some benchmarks to find the best buffer size for. I recommend you start with 32M and double the amount of memory available for the RAM buffer at every iteration, stop when increasing the ram buffer size does not yield any significant improvement. — jpountz, Sep 05 '11 at 08:27
2. They are not : Loading field caches (required for sorts and functions queries among others) takes time with Solr, as a consequence, the first query which will use field caches on a fresh index will have a performance penalty, so you need to put queries that will load these field caches (just put a query which performs sorts on the same fields as your application will) in newSearcher and firstSearcher. — jpountz, Sep 05 '11 at 08:31
3. I think it is better to have only one instance running : some memory will be shared between the cores, as a consequence the global amount of memory required will be lower, leaving more memory for the operating system's I/O cache, which is a very important performance factor for Solr : http://java.dzone.com/news/os%E2%80%99s-cache-does-matter-query — jpountz, Sep 05 '11 at 08:37

score 1 · Answer 2 · answered Sep 02 '11 at 14:10

You can also wait for more documents and indexing them only each X minutes. (Of course this highly depend of your application & requirements)
If your documents are small and you don't need all data (which are stored in MongoDB) you can put only the field you need in the Solr Document by storing them but not indexing

<field name="nameoyourfield" type="stringOrAnyTypeYouuse"indexed="false"stored="true"/>

java - MongoDB + Solr performances

2 Answers2

Linked