0

As part of data analysis, I collect records I need to store in Elasticsearch. As of now I gather the records in an intermediate list, which I then write via a bulk update.

While this works, it has its limits when the number of records is so large that they do not fit into memory. I am therefore wondering if it is possible to use a "streaming" mechanism, which would allow to

  • persistently open a connection to elasticsearch
  • continuously update in a bulk-like way

I understand that I could simply open a connection to Elasticsearch and classically update as data are available but this is about 10 times slower, so I would like to keep the bulk mechanism:

import elasticsearch
import elasticsearch.helpers
import elasticsearch.client
import random
import string
import time

index = "testindexyop1"
es = elasticsearch.Elasticsearch(hosts='elk.example.com')
if elasticsearch.client.IndicesClient(es).exists(index=index):
    ret = elasticsearch.client.IndicesClient(es).delete(index=index)

data = list()
for i in range(1, 10000):
    data.append({'hello': ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))})

start = time.time()
# this version takes 25 seconds
# for _ in data:
#     res = es.bulk(index=index, doc_type="document", body=_)

# and this one - 2 seconds
elasticsearch.helpers.bulk(client=es, index=index, actions=data, doc_type="document", raise_on_error=True)

print(time.time()-start)
WoJ
  • 27,165
  • 48
  • 180
  • 345

1 Answers1

0

You can always simply split data into n approximately equally sized sets such that each of them fits in memory and then do n bulk updates. This seems to be the easiest solution to me.

Martin Krämer
  • 567
  • 3
  • 17
  • yes, this is a good solution - but it somehow rewrites the streaming functionality. I was looking for something possibly built-in. – WoJ Apr 02 '15 at 11:34
  • Have you looked at this? http://elasticsearch-py.readthedocs.org/en/latest/helpers.html#elasticsearch.helpers.streaming_bulk – Martin Krämer Apr 02 '15 at 11:39