1

Currently I have to update a field in over 1 million documents indexed in elasticsearch. This is a complex task due to this field contains metadata generated from XML files, evaluating xpath expressions. We have to loop over all the documents in the index and update this field. So, in order to avoid overkill the system, we decide to use the ironworker platform.
I have read several post about how to update millions of docs in elasticsearch, like this one, but given that we are gonna use ironworkers there are some restrictions, like a task can only run for 60 minutes.

Question: How I loop over all the documents and update its fields, considering the restriction of 60 min.
I thought opening and scroll and pass the scroll_id to the next worker, but I don't have an idea of how long will take to execute the next task, so the scroll could expire and I will have to start all over.

Community
  • 1
  • 1
Yasel
  • 2,920
  • 4
  • 40
  • 48
  • 1 Mio documents can be updated in a very short time, but it depends on a lot of things. The restriction of 60 min is only imposed because you've decided to go with ironworker but I'm sure there are other alternatives that would not overkill your system. Unfortunately, we don't know enough about your requirements. What involves your "complex task" of retrieving XML metadata? Can you show a sample of that XML metadata? A sample document? – Val May 15 '15 at 03:41
  • @Val, that's the thing, this task can be as complex as the client decide. The metadata analysis start from an attachment uploaded by the user and list of xpath expressions defined by him. So, we need to be prepared for any degree of complexity. It would be a good start if I find the way to chain one ironworker to another, and make the second start after a known period of time. So I could be able to keep the scroll open for the next worker. – Yasel May 15 '15 at 14:24

1 Answers1

1

It sounds from your description that you could chain together IronWorker tasks, which is actually very easy. If you have some idea of how long it takes to get through updating a single item, then you could extrapolate how long you need. Let's say it took 100ms to update an item, then you could do 10 per second, or 600 per minute so maybe do 6000 (which should take about 10 minutes), then queue up the next one from your code. Queuing up the next task is just as easy as queuing up the first task: http://dev.iron.io/worker/reference/api/#queue_a_task (can use the client library for your language too).

Or just stop after X minutes and queue up the next worker.

Or if you want to make things faster, how about queue up 26 at the same time, one for each letter of the alphabet? Each one can query for all the items starting with the letter it's assigned to (Prefix Query ) .

There's many ways to slice this problem.

Travis Reeder
  • 38,611
  • 12
  • 87
  • 87
  • thanks Travis, there is still one issue, I need to keep open the scroll enough time for the next task, so, if I schedule the next task with no delay, the ironworker platform can ensure this or it depends on how many other tasks are scheduled? – Yasel May 19 '15 at 14:15
  • 1
    The next job should start within seconds. – Travis Reeder May 21 '15 at 05:23