I loaded over 38 million documents (text strings) to an Elasticsearch index on my local machine. I would like to compute the length of each string and add that value as meta data in the index.
Should I have computed the string lengths as meta data before loading the documents to Elasticsearch? Or, can I update the meta data with a computed value after the fact?
I'm relatively new to Elasticsearch/Kibana and these questions arose because of the following Python experiments:
Data as a list of strings
mylist = ['string_1', 'string_2',..., 'string_N'] L = [len(s) for s in mylist] # this computation takes about 1 minute on my machine
The downside of option 1 is that I'm not leveraging Elasticsearch and 'mylist' is occupying a large chunk of memory.
Data as an Elasticsearch index where each string in 'mylist' was loaded into the field 'text'.
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore document_store = ElasticsearchDocumentStore(host='localhost', username='', password='', index='myindex') docs = document_store.get_all_documents_generator() L = [len(d.text) for d in docs] # this computation takes about 6 minutes on my machine
The downside of option 2 is that it took much longer to compute. The upside is the generator() freed up memory. The long computation time is why I thought storing the string length (and other analytics) as meta data in Elasticsearch would be a good solution.
Are there other options I should consider? What am I missing?