I have a 200M documents index I would like to reindex.
I wrote the following script that goes over documents in the old index and puts them with balk insert into the new index.
The size of each bulk is 2000 documents.
search_obj = pyes.query.Search(query = pyes.query.MatchAllQuery(), start=resume_from)
old_index_iterator = self.esconn.search(search_obj, self.index_name)
counter = 0
BULK_SIZE = 2000
for doc in old_index_iterator:
self.esconn.index(doc=doc, doc_type=DOC_TYPE, index=new_index_name, id=doc.get_id(), bulk=True)
counter += 1
if counter % BULK_SIZE == 0:
self.logger.debug("Refreshing...")
self.esconn.refresh()
self.logger.debug("Refresh done.")
self.esconn.refresh()
Observation:
- The speed that I get is very slow: around 150 documents / minute.
- The time of the refresh operation is 0. If I remove the index command (just read from the DB) - I speed up 10 times.
Conclusion:
- The index ignores the bulk=True flag, and pushes every single document to the ES server.
Anyone know please help me to figure out why bulk=True has no effect?