As part of data analysis, I collect records I need to store in Elasticsearch. As of now I gather the records in an intermediate list, which I then write via a bulk update.
While this works, it has its limits when the number of records is so large that they do not fit into memory. I am therefore wondering if it is possible to use a "streaming" mechanism, which would allow to
- persistently open a connection to elasticsearch
- continuously update in a bulk-like way
I understand that I could simply open a connection to Elasticsearch and classically update as data are available but this is about 10 times slower, so I would like to keep the bulk mechanism:
import elasticsearch
import elasticsearch.helpers
import elasticsearch.client
import random
import string
import time
index = "testindexyop1"
es = elasticsearch.Elasticsearch(hosts='elk.example.com')
if elasticsearch.client.IndicesClient(es).exists(index=index):
ret = elasticsearch.client.IndicesClient(es).delete(index=index)
data = list()
for i in range(1, 10000):
data.append({'hello': ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))})
start = time.time()
# this version takes 25 seconds
# for _ in data:
# res = es.bulk(index=index, doc_type="document", body=_)
# and this one - 2 seconds
elasticsearch.helpers.bulk(client=es, index=index, actions=data, doc_type="document", raise_on_error=True)
print(time.time()-start)