3

I am new to ElasticSearch and I have a file of 180 fields and 12 million lines. I have created an index and type in ElasticSearch and Java Program but it takes 1.5 hrs. Is there any other best way to to load data into ElasticSearch with reduced time. I have tried a map reduce program but some times it fails and generates duplicate entries and take more time than time my sequential program.

Can anybody give good suggestions ?

James Z
  • 12,209
  • 10
  • 24
  • 44
Jerin J
  • 75
  • 1
  • 5
  • Crore and Lakh have been added to my English dictionary, thx :D, aside from that `10200000` is rather a large number, possibly you may need to have a cluster or something – nafas Jan 11 '16 at 14:09
  • Please don't use location-specific numbers like lakh and crore as most of us will need to look them up. – Wai Ha Lee Jan 11 '16 at 14:46
  • 1
    Are you using bulk upload and have you tried different batch sizes? Have you tuned ES parameters such as flushes to disk? Are you seeing CPU, disk or network saturation? How much memory you have in total and how much for ES heap? – NikoNyrh Jan 11 '16 at 18:30
  • How many Documents are included in the file? (how many records do you expect to have in ES after its complete?) – AssHat_ Feb 21 '16 at 10:19
  • You need to divide the input into individual records, then Spark can write the data to ES in parallel from each worker node. If this just defines a single record you need an advanced ES mapping for Spark to help. – AssHat_ Feb 21 '16 at 10:32

1 Answers1

0

You may disable speculative execution when using ES-hadoop plugin to avoid duplicate entries. Try to fine tune the batch size of bulk api when using map-reduce to index the data. For more information please refer :-https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html and try changing the defaults to attain best performance. Also try to increase ES heap size. Also you can use apache Tika or mapper attachments plugin of ES to extract out information from file.

Hope it helps!

Sachin
  • 1,675
  • 2
  • 19
  • 42