2

I have a 200M documents index I would like to reindex.

I wrote the following script that goes over documents in the old index and puts them with balk insert into the new index.

The size of each bulk is 2000 documents.

 search_obj = pyes.query.Search(query = pyes.query.MatchAllQuery(), start=resume_from)

 old_index_iterator = self.esconn.search(search_obj, self.index_name)
 counter = 0
 BULK_SIZE = 2000

 for doc in old_index_iterator:
   self.esconn.index(doc=doc, doc_type=DOC_TYPE, index=new_index_name, id=doc.get_id(), bulk=True)
   counter += 1

   if counter % BULK_SIZE == 0:
     self.logger.debug("Refreshing...")
     self.esconn.refresh()
     self.logger.debug("Refresh done.")


 self.esconn.refresh()

Observation:

  1. The speed that I get is very slow: around 150 documents / minute.
  2. The time of the refresh operation is 0. If I remove the index command (just read from the DB) - I speed up 10 times.

Conclusion:

  • The index ignores the bulk=True flag, and pushes every single document to the ES server.

Anyone know please help me to figure out why bulk=True has no effect?

diemacht
  • 2,022
  • 7
  • 30
  • 44
  • Check this Q http://stackoverflow.com/questions/9002982/elasticsearch-bulk-index-in-chunks-using-pyes for help . – Gentle Y Mar 17 '16 at 00:34

1 Answers1

0

Your low speed is about reading from the old index not inserting into the new one.

Try scan mode and scroll when read:

result_set = self.esconn.search(pyes.query.MatchAllQuery(),indices=INDEX_NAME, doc_types=INDEX_TYPE, scan=True, scroll_timeout="10m")
for doc in result_set:
    pass # do your insert task

And also the default refresh size is 400 and interval is 1s, usually it does not need to reset this setting.

Gentle Y
  • 331
  • 1
  • 10