Adding a huge number of files into Solr

Question

Iam trying to index documents to solrj. iam using Solr4.5 and i have huge files to be indexed . what are the ways to index each files in order to avoid performance bottleneck.

how long are we talking about? You only have to index these files once. — Bartlomiej Lewandowski, Jan 07 '14 at 21:45
@Bartlomiej Lewandowski : Its indexing 35,000 records in 1 hour . So total records of 700,000 records i have to wait for rest . .Yes i have to index these files once.But iam calling solr update request for each file. — user3161879, Jan 07 '14 at 22:13

score 0 · Answer 1 · answered Jan 07 '14 at 22:26

0

Update for each document is slow with solr.

You are much better with adding all the documents, and then doing a commit with update. Taken from the solr wiki:

Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
docs.add( doc1 );
docs.add( doc2 );

UpdateRequest req = new UpdateRequest();
req.setAction( UpdateRequest.ACTION.COMMIT, false, false );
req.add( docs );
UpdateResponse rsp = req.process( server );

answered Jan 07 '14 at 22:26

Bartlomiej Lewandowski

10,771
14
44
75

Iam getting each doument to be indexed from parsing another file line by line. So i dont have any collection of documents as such. I have one file at a time .So shall i add files to this collection everytime i loop? and then call this update request for that collection? – user3161879 Jan 07 '14 at 22:38
@user3161879 Yes, add them to the collection, and when you are done, make the update request with the collection. – Bartlomiej Lewandowski Jan 07 '14 at 22:57
i changed the code and iam passing the collection document to server solr. But every document inside has the id explicitly set. But solr is throwing me an exception as Document is missing mandatory uniqueKey field: id. what extra needs to be done. – user3161879 Jan 09 '14 at 18:05
@user3161879 you can read about generating ids [here](http://wiki.apache.org/solr/UniqueKey) – Bartlomiej Lewandowski Jan 09 '14 at 18:16
but id is already there inside each files. I was able to add to solr before creating a collection. So for collection ,iam not sure whether it needs id. – user3161879 Jan 09 '14 at 21:38
You might have misunderstood. The reason why you need an id is because you have a mandatory field id in your schema. If you add them by your previous method, tika generates the id, without it, you have to do it yourself. – Bartlomiej Lewandowski Jan 10 '14 at 00:59
@ Bartlomiej Lewandowski : iam sorry iam not able to understand this id concept. Earlier in my code ,i was explicitly setting the id for each document i was passing to solr. But then we have set the id in each document itself and i was not passing any id to my solr while indexing these documents. So now why do we need id? sorry iam not able to understand it. – user3161879 Jan 10 '14 at 22:04
@user3161879 could you show a sample document that you are trying to index? – Bartlomiej Lewandowski Jan 11 '14 at 21:22

score 0 · Accepted Answer · answered Jan 08 '14 at 07:25

0

First thing to check is server side log and look for messages about commits. It's possible you are doing a hard commit after parsing each file. That's expensive. You could look into soft commits or commitWithin params to have files show up slightly later.

Secondly, you seem to be sending a request to Solr to fetch your file and run Tika extract on it. So, this probably restarts Tika inside Solr every time. You will not be able to batch that as other answers seem to suggest.

But you could run Tika locally in your client and initialize it once and keep it around. That then gives more flexibility on how to construct your SolrInputDocument, which you can then batch.

answered Jan 08 '14 at 07:25

Alexandre Rafalovitch

9,709
1
24
27

how do we run tike extract externally. i thought solrj takes care of that internally. – user3161879 Jan 08 '14 at 14:36
If I understand your code correctly, you are sending this request to extract handler. Which means the Solr *server* is running Tika. Instead, you can just instantiate Tika within your Java process and run it yourself. You will loose some of the field mapping functionality Solr implements, but I suspect you are not using it yet. – Alexandre Rafalovitch Jan 09 '14 at 00:22
Yes will try your suggestion and check. But one more thing also ,when iam trying to index multiple documents one by one,it indexes for 100,000 records successfully and then rest of files it starts saying Server at http://servername/solr/ returned non ok status:500, message:Internal Server Error. What could be the reason? – user3161879 Jan 09 '14 at 14:44
If you see the above code ,i have create instance of UpdateRequest outside at once.. Then iam passing this instance to the solr indexing for each files. Do you mean the same? – user3161879 Jan 09 '14 at 16:01
Look at the server logs, they tell you the cause of error. Maybe you are not doing commits and it is running out of memory. You can configure periodic soft and hard commits in solrconfig.xml – Alexandre Rafalovitch Jan 10 '14 at 00:30
@ Alexandre Rafalovitch i checked the solrconfig.xml .its set for every 10 minutes there. – user3161879 Jan 10 '14 at 15:40
Look at the logs. They will tell you what the error is. They will also tell you when your commits are happening. Just read carefully. – Alexandre Rafalovitch Jan 12 '14 at 10:06

Adding a huge number of files into Solr

2 Answers2