0

I have a folder with around 590,035 json files. Each file is a document that has to be indexed. If I index each document using python then it is taking more than 30 hours. How do I index these documents quickly?

Note - I've seen bulk api but that requires merging all the files into one which takes similar amount of time as above. Please tell me how to improve the speed. Thank You.

  • "which takes similar amount of time as above" => how do you know that? Have you actually tried it? – Val Jan 17 '19 at 12:16
  • use multiple threads – Stack Jan 17 '19 at 12:17
  • @Val Yeah, Used file operations in python to merge the files and the avg time was smiliar –  Jan 17 '19 at 12:21
  • @Stack Won't there be issues with `elasticsearch` if two threads try to index two documents at the same time? Will synchronising the `index` function solve this? –  Jan 17 '19 at 12:22
  • But did you use the bulk API? i.e. you added the command line between each doc? – Val Jan 17 '19 at 12:25
  • Have you tried this? https://elasticsearch-py.readthedocs.io/en/master/helpers.html#bulk-helpers – Val Jan 17 '19 at 12:25

1 Answers1

0

If you're sure that I/O is your bottleneck, use threads to read files, i.e. with ThreadPoolExecutor, and either accumulate for bulk request, or save one by one. ES will have no issues whatsoever, until you're using either unique or internal IDs.

Bulk will work faster, just by saving you time on HTTP overhead, saving 1 by 1 is a little bit easier to code.

Slam
  • 8,112
  • 1
  • 36
  • 44
  • So you're suggesting that I use `ThreadPoolExecuter` to merge all the files into one and then use the `bulk` api to index the documentes? –  Jan 17 '19 at 12:34
  • Merging 590K files together is probably not going to work, way too much data in my opinion, you'll have to do it in chunks, see the link I shared above – Val Jan 17 '19 at 12:42
  • Val if right, you should chunk it by, lets say, 1k files for each request. Threading is only to speed up reading files – Slam Jan 17 '19 at 12:55
  • @Val I did see the above link. Thanks for that. How do I make the chunks so that each request has 1k files? –  Jan 17 '19 at 13:05
  • You have several examples across SO, but here is one: https://stackoverflow.com/a/54214168/4604579 – Val Jan 17 '19 at 13:10
  • Hi @Slam, I used `ThreadPoolExecutor` with 10 workers and it did reduce the time from 40 hours to 10 hours but it's still a lot of time. Even after increasing the number of workers the time is around 10 hours. Any way I can improve this? –  Jan 21 '19 at 10:57
  • Needs to be profiled, I don't know what's your code and what's bottlenecks there. If increasing threads not helping, this is most probably because you're near I/O limit for file reads. Main improvement is batching your requests to ES. – Slam Jan 21 '19 at 14:37
  • Code is similar to the documentation's example. The two major parts of the code are opening and closing each of the `json` file and using `es.index()` to index the content of each file. Is there a way such that I can divide all the files into batches, use `ThreadPoolExecuter` for each batch and also run all the batches simultaneously? –  Jan 21 '19 at 15:19
  • One way is to write data to some queue, and eventually read the queue in other thread and make bulk requests. – Slam Jan 21 '19 at 15:48