0

I am using ElasticSearch to index some data. But I found that the performance is not that efficiency.

There are only 3000 entries data and each data has 6 columns. It costs 5 mins to index these 3000 entries.

Because I am new with ElasticSearch, my code and program flow are basic as following:

  1. search and check is there any same data with it.
  2. if there is same data, then update.
  3. If not, then add.

The code is following:

conn = pyes.ES('server:9200')

Search:

searchResult = conn.search(searchDict, indexName, TypeName)

Index

conn.index(storeDict, indexName, TypeName, id)

Update the Count in the index data.

 conn.partial_update(indexName, TypeName, id, "ctx._source.Count += counter", params={"counter" : 1})

Is there any method that can improve the performance of my code ?

Thank you for your help.

Jimmy Lin
  • 1,481
  • 6
  • 27
  • 44
  • 1
    Could you make the title of the question a little more descriptive. It seems more about improving the way you use elasticsearch in your application, than improving its own performance. – javanna Jul 26 '13 at 08:50

2 Answers2

1

You don't need to search before updating. Read the es docs on updating and scroll down to the upsert section. upsert is a parameter which holds a document to use if the document does not exist on the server, otherwise the upsert is ignored and it works like a normal update request (as you are doing now).

Good luck!

ramseykhalaf
  • 3,371
  • 2
  • 17
  • 16
  • HI, It's me again. I revised my code with upsert and it's better. It costs 3 mins to finish. Is there any method to let the time less than 1 min ? – Jimmy Lin Jul 25 '13 at 06:49
  • I change the config file in /bin/elasticsearch.in.sh, bit it's seems doesn't work even I restart the elasticsearch. How can I let elasticsearch read the new setting file ? – Jimmy Lin Jul 25 '13 at 07:15
  • I'm not so sure about the settings files sorry. If you want to get the index time even lower then don't use an update script. What I would experiment with (if you are incrementing one field in the update) is to calculate the result of the modification, create a new document in pyres, then just overwrite the old document. (Use the normal put api, as you are doing in step 3 of your question.) – ramseykhalaf Jul 25 '13 at 07:24
  • Also I forgot to mention, you should look at the `_bulk` [api from the es docs](http://www.elasticsearch.org/guide/reference/api/bulk/) – ramseykhalaf Jul 25 '13 at 07:40
  • 1
    It's unbelieveable, I use bulk size 400, and it cost only 2 secs to finish this job. – Jimmy Lin Jul 25 '13 at 10:12
  • @Jimmy use bulk size 400? you mean when you created the connection, you did conn = ES("server:9200", bulk_size=400)? I did that but the performance is the same. – B.Mr.W. Oct 14 '13 at 21:58
1
  • You can use versioning feature of elasticsearch. If you are deciding your documents id's its pretty easy. It simply re-index the data.

  • You should use BULK API for indexing.(1000-5000 is good)

  • Another reason of bad performance is about configuration settings on config/elasticsearch.yml, you can use this hints to increase indexing performance.

shyos
  • 1,390
  • 1
  • 16
  • 29