Solr indexing of a large data set

Question

I have content that is about 50 TB large. The number of documents in this set is about 250 million. The daily increment to this is not very large nay my be about 10000 documents of varying sizes totaling under 50 MB. The current indexing effort is taking way too long and is guesstimated to complete in 100+ days!!!
So ... is this really that large of a data set? To me, 50 TB of content (in this day and age) is not very large. Do you have content of this size? If you do, how did you improve time taken for one-time indexing? Also, how did you improve time taken by real-time indexing?
If you can answer .. great. If you can point me in the right direct direction ... appreciate that as well.

Thanks in advance.
rd

check this http://stackoverflow.com/a/31935578/2254048. Also disable the softCommit for bulk indexing if it is on. Also read this https://wiki.apache.org/solr/SolrPerformanceFactors. — YoungHobbit, Sep 25 '15 at 17:45
Numbers in themselves are fairly meaningless with Solr: Simple CSV imports can handle 30K docs/second, sufficiently complex Tika processing can mean 1 doc/minute.If YoungHobbit's suggestions does not help, then please describe in more detail what data you are handling and how you add them to Solr. — Toke Eskildsen, Sep 26 '15 at 14:13

score 1 · Answer 1 · answered Sep 25 '15 at 18:29

There are number of factors to consider.

You can start with Client to index. Which client are you using. Is it Solrj, or any framework which listens to databases(like oracle or Hbase) or rest API. This can make a difference, given that Solr is good at handling them, however the client framework and data preparation at client, also needs to be optimized. For example, if you use Hbase Indexer(which reads from Hbase tables and writes to Solr), you can expect few millions to be indexed in hour or so. Then, this should not take much time to complete 250 million.
After the client, you enter into Solr environment. How many fields are you indexing in you document. Also do you have stored fields or any other overheads for field types.
Config parameters like autoCommit based on number of records or RAm size, softCommit as mentioned in the comment above, Parallel Threads to index data, Hardware are some of the points to cosider.

You can find comprehensive check list here and can verify each. Happy Designing

Solr indexing of a large data set

1 Answers1