0

I have an AWS ElasticSearch cluster and I have created an index on it. I want to upload 1 million documents in that index. I am using Python package elasticsearch version 6.0.0 for doing so.

My payload structure is similar to this -

{  
   "a":1,
   "b":2,
   "a_info":{  
      "id":1,
      "name":"Test_a"
   },
   "b_info":{  
      "id":1,
      "name":"Test_b"
   }
}

After discussion in the comment section, I realise that total number of fields in a document also includes its subfields. So in my case, total number of fields in each document goes to 60 in count.

I have tried the following methods -

  1. Using Bulk() interface as described in the documentation(https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.bulk). The error that I received using this method are -
    • Timeout response after waiting for ~10-20 min.

In this method, I have also tried uploading documents in batch of 100 but still getting timeout.

  1. I have also tried adding documents one by one as per documentation(https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.Elasticsearch.create) This method takes a lot of time to create upload even one document. Also, I am getting this error for few of the documents -
TransportError(500, u'timeout_exception', u'Failed to acknowledge mapping update within [30s]')

My index settings are these -

{"Test":{"settings":{"index":{"mapping":{"total_fields":{"limit":"200000000"}},"number_of_shards":"5","provided_name":"Test","creation_date":"1557835068058","number_of_replicas":"1","uuid":"LiaKPAAoRFO6zWu5pc7WDQ","version":{"created":"6050499"}}}}}

I am new to ElasticSearch Domain. How can I upload my documents to AWS ES Cluster in a fast manner?

Sahil Chaudhary
  • 111
  • 1
  • 9
  • Are you sure that your documents have 200,000,000 fields? – Val May 17 '19 at 05:36
  • Earlier I was getting this error while uploading documents one by one - "Limit of total fields [1000] in index has been exceeded". To Resolve that error I updated number of fields to 200,000,000 fields. I am sure that number of fields after my whole dataset is uploaded will reach around 100,000,000 but the number of documents might increase in future, so for safer side I have set number of fields to 200,000,000 fields. – Sahil Chaudhary May 17 '19 at 05:42
  • I sincerely doubt that you will have documents with 200,000,000 fields in them. You might have 200,000,000 documents for sure, but way less fields than that in each document, or else you have a data design issue. – Val May 17 '19 at 05:48
  • Thanks @Val for pointing that out. Yes, You are right that my documents count will reach upto 200,000,000 but my number of fields in each document will be ~15. What do you think will be suitable field limit in this case? Also, Is this the reason why I am getting the timeout error? – Sahil Chaudhary May 17 '19 at 06:02
  • with 15 fields per doc, you can leave the default field limit, which is 1000, no need to override it – Val May 17 '19 at 06:28
  • @Val By doing that I am getting this error again - "RequestError(400, u'illegal_argument_exception', u'Limit of total fields [1000] in index [index_name] has been exceeded')" while adding a single document. – Sahil Chaudhary May 17 '19 at 07:55
  • then it means you have more than 1000 fields in some of your documents. Cany ou show your mapping ? What do you get when running `GET Test`? – Val May 17 '19 at 07:58
  • I have checked number that I have using this command AWS_ES_HOST/index_name/_mapping?pretty. The number of "type" keyword goes upto 200000 in the response using steps as explained in https://stackoverflow.com/questions/48490006/indexing-twitter-data-into-elasticsearch-limit-of-total-fields-1000-in-index . The reason for this variable number of fields is because I am storing a dictionary for each document with variable number of keys. – Sahil Chaudhary May 17 '19 at 08:04
  • Can you show what you get? – Val May 17 '19 at 08:05
  • The response is very large...so can't show you directly. Here is a snippet - "ef" : { "properties" : { "1" : { "properties" : { "3306" : { "properties" : { "field_id" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, – Sahil Chaudhary May 17 '19 at 08:10
  • 2
    Ok, I see that some fields are actually numbers ("1", "3306", etc), so you seem to have fields that are hashes with arbitrary keys, hence probably why you cross the 1000 limit. Just know that the total number of fields includes all sub-fields and nested fields as well. That's not necessarily the best data design, but I don't know your exact use case either. – Val May 17 '19 at 08:12
  • Thanks @Vals for the info. I have updated the number of fields in the question description also. – Sahil Chaudhary May 17 '19 at 08:19

0 Answers0