How do I detect unexpected worker role failures and reprocess data in those cases?

Question

I want to create a web service hosted in Windows Azure. The clients will upload files for processing, the cloud will process those files, produce resulting files, the client will download them.

I guess I'll use web roles for handling HTTP requests and worker roles for actual processing and something like Azure Queue or Azure Table Storage for tracking requests. Let's pretend it'll be Azure Table Storage - one "request" record per user uploaded file.

A major design problem is processing a single file can take anywhere from one second to say ten hours.

So I expect the following case: a worker role is started, gets to Azure Table Storage, finds a request marked "ready for processing", marks it "is being processed", starts actual processing. Normally it would process the file and mark the request "processed", but what if it dies unexpectedly?

Unless I take care of it the request will remain in "is being processed" state forever.

How do I track requests that are marked "is being processed" but abandoned? What mechanism in Windows Azure would be most convenient for that?

dunnry · Answer 1 · 2011-05-20T14:01:08.820

4

The main issue you have is that queues cannot set a visibility timeout larger than 2 hrs today. So, you need another mechanism to indicate that active work is in progress. I would suggest a blob lease. For every file you process, you either lease the blob itself or a 0-byte marker blob. Your workers scan the available blobs and attempt to lease them. If they get the lease, it means it is not being processed and they go ahead and process. If they fail the lease, another worker must actively be working on it.

Once the worker has completed processing the file, it simply copies the file into another container in blob storage (or deletes it if you wish) so that it is not scanned again.

Leases are really your only answer here until queue messages can be renewed.

edit: I should clarify that the reason that leases would work here is that a lease must be actively maintained every 30 seconds or so, so you have a very small window where you know if someone has died or is still working on it.

edited May 20 '11 at 14:01

answered May 20 '11 at 13:37

dunnry

6,858
1
20
20

Forgot about the 2-hour queue message limit - I rarely have queue messages living that long). However, Service Bus messages have a far greater timeout (just released a few days ago). – David Makogon May 20 '11 at 13:52
Will calling "renew lease" be billed against my account? – sharptooth May 20 '11 at 14:33
1

Yes. Every REST call you make is billed as a transaction. The lease call is a PUT, so 1 transaction. If you renewed the lease every 30 seconds, it would take almost a year (per lease) before that cost you a $1 however (347 days). – dunnry May 20 '11 at 14:39

score 2 · Accepted Answer · answered May 20 '11 at 08:36

I believe this problem is non technology specific.
Since your processing jobs are long running, I suggest these jobs should report their progress during execution. In this way a job which has not reported progress for a substantial substantial duration becomes a clear candidate for cleanup and then can be restarted on another worker role.
How you record progress and do job swapping is upto you. One approach is to use database as recording mechanism and creating an agent worker process that pings the job progress table. In case the worker process determines any problems it can take corrective actions.

Other approach would be to associate the worker role identification with the long running process. The worker roles can communicate their health status using some sort of heart beat.
Had the jobs not been long running you could have flagged the start time of job instead on status flag and could have used the timeout mechanism to determine whether the processing has failed.

David Makogon · Answer 3 · 2011-05-20T13:53:42.537

The problem you describe is best handled with Azure Queues, as Azure Table Storage won't give you any type of management mechanism.

Using Azure Queues, you set a timeout when you get an item of the queue (default: 30 seconds). Once you read a queue item (e.g. "process file x waiting for you in blob at url y"), that queue item becomes invisible for the time period specified. This means that other worker role instances won't try to grab it at the same time. Once you complete processing, you simply delete the queue item.

Now: Let's say you're almost done and haven't deleted the queue item yet. All of a sudden, your role instance unexpectedly crashes (or the hardware fails, or you're rebooted for some reason). The queue-item processing code has now stopped. Eventually, when time passes since originally reading the queue item, equivalent to the timeout value you set, that queue item becomes visible again. One of your worker role instances will once again read the queue item and can process it.

A few things to keep in mind:

Queue items have a dequeue count. Pay attention to this. Once you hit a certain number of dequeues for a specific queue item (I like to use 3 times as my limit), you should move this queue item to a 'poison queue' or table storage for offline evaluation - there could be something wrong with the message or the process around handling that message.
Make sure your processing is idempotent (e.g. you can process the same message multiple times with no side-effects)
Because a queue item can go invisible and then return to visibility later, queue items don't necessary get processed in FIFO order.

EDIT: Per Ryan's answer - Azure queue messages max out at a 2-hour timeout. Service Bus queue messages have a far-greater timeout. This feature just went CTP a few days ago.

That won't fit my task, since I can't come up with a reasonable default value. Any task can take any time from one second to many hours to process and that's normal. — sharptooth, May 20 '11 at 11:27
So, why not just set the timeout to 12 hours for all queue items? The worst-case scenario is that a failed task (due to a crash) won't be reprocessed for 1/2-day. As an alternative, can you predict a ballpark timeout value prior to placing it on the queue? If so, you can then set up 2 or more queues (e.g. fastq, mediumq, slowq) and spawn threads to read from each, using timeouts of, say, 30 seconds, 1 hour, 12 hours. — David Makogon, May 20 '11 at 11:33
I can't predict it and it can take 14 hours just as well - there's no sane upper limit. Whatever I set the timeout to some items will be locked for a rather long period. For example, a user uploads a file that need 30 seconds to process and the role processing it crashes. The user will have to wait for the entire timeout before it is reprocessed even if it's the only file in the system. — sharptooth, May 20 '11 at 11:37
We're talking about an edge case. I've had role instances run for a month before getting recycled (specifically for OS updates). Just take this into consideration when designing your solution. You might only run into this for 2 or 3 items monthly out of how many? Only you can decide if that's worth extra engineering (such as turning this into a staged workflow). — David Makogon, May 20 '11 at 11:55

Oliver Bock · Answer 4 · 2011-05-23T00:05:20.153

1

Your role's OnStop() could be part of the solution, but there are some circumstances (hardware failure) where it won't get called. To cover that case, have your OnStart() mark everything with the same RoleInstanceID as abandoned, because it wouldn't be called if anything was still happening. (You can observe that Azure reuses its role instance IDs, luckily.)

edited May 23 '11 at 00:05

answered May 20 '11 at 07:57

Oliver Bock

4,829
5
38
62

Sounds good, but is that reuse documented as only possible behavior anywhere? – sharptooth May 20 '11 at 07:58

How do I detect unexpected worker role failures and reprocess data in those cases?

4 Answers4

Linked