0

I'm using SOLR 7.2, and i' trying to index a 133k document with dataimportHandler.

The problem is that indexation tooks large time (4 hours), especially after indexing 50k documents. After a large analysis of this problem, I found out that indexed mutivaluated fields are responsible for this heavy indexation. However, when setting multivaluated fields to indexed="false" indexation is going very fast(couple of minutes).

Is there a way to speed up indexation throw changing configuration or anything else.

   <fieldType name="text_fr_lemmatized" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-select.txt" />
     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-apostrophe.txt" />
     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ponctuation.txt" />   
     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt" />   
     <tokenizer class="solr.StandardTokenizerFactory" />

     <filter class="solr.LowerCaseFilterFactory" />

     <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt" />


     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" />
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" />
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
     <filter class="solr.HunspellStemFilterFactory" dictionary="fr_FR.dic" affix="fr_FR.aff" ignoreCase="true" strictAffixParsing="true" />

         <filter class="solr.LowerCaseFilterFactory" />

  </analyzer>

beji dhia
  • 111
  • 1
  • 1
  • 7
  • What are you field types for the multivalued fields, and what's the analyzer/tokenizer/filter setup for them? – Jayce444 Jun 03 '18 at 13:41
  • I just updated my post, with details of analyzer/tokenizer/filter , for the field types they are all text. – beji dhia Jun 03 '18 at 15:53
  • If you're setting `indexed="false"`, you're not really doing anything with the field - so it's not really weird that things are going very fast when you're not processing anything for the field. Does it go faster if you commit more often? How about indexing 50k documents at a time from the same database, instead of all 133k in the same batch? I.e. does it take time in the DB layer or in the Solr part? – MatsLindh Jun 03 '18 at 18:47
  • When I set indexed=false, everythink is going fast, really fast. But I can't search on those fields later.I used to commit one time at the end of the indexation I think that commit will decrease indexing time. I ve tried to index data by dividing it into chunk, each with 50K but the prolem still the same. The first chunk is indexed in 20min and the other in 2 hours each. – beji dhia Jun 03 '18 at 19:59
  • I figured out that is decreasing performance. I dont know why. – beji dhia Jun 03 '18 at 20:02
  • You should only commit batches, rather than individual ones. Commit after every 10K documents (make sure there's no `autoCommit` in your configuration as well). Also, is there anything else about your schema that relates to these multivalued fields that are slowing indexing down, e.g. they're all being copied into another? I found some issues about Hunspell making things very slow, though those were from Solr 3/4. – Jayce444 Jun 03 '18 at 23:37
  • I commented all other mutivaluated field, and the result is the same. when i comment the hundspell every think is going well. – beji dhia Jun 04 '18 at 05:59
  • Attach a profiler and see if Hunspell keeps some state between requests that makes its memory usage blow up (if it only gets slow after a while) - or try a different stemmer (.. or try to optimize the Hunspell-code if possible). – MatsLindh Jun 04 '18 at 07:04
  • @MatsLindh can you explain further how to Attach a profiler please. If it's memory issue, I can enlarge it. – beji dhia Jun 04 '18 at 07:22
  • There are many java profilers available that can help you find out where the Java VM is spending its time - [jprofiler](https://www.ej-technologies.com/products/jprofiler/overview.html) is one of them. This assumes that you have experience with Java development. If not, indexing in separate batches (if it gets slow after a while) is probably an easier fix. – MatsLindh Jun 04 '18 at 07:32

0 Answers0