Optimal architecture for processing and storing large data blocks quickly

Question

I have a situation which involves an MVC app, to which a potentially large number of up to 32MB chunks of data are uploaded. After each chunk is uploaded, it needs to be processed and a response sent before the client browser uploads the next chunk.

Ultimately the data and the results of its processing need to be stored on Azure storage. The data processing is CPU intensive. Given that transferring this amount of data takes an appreciable amount of time, I am looking to reduce the number of trips the data needs to do between machines, as well as move the work out of the web server threads.

Currently this is done by queuing up the jobs which are consumed by a single worker thread.

However, this process needs to be upgraded such that it runs an executable to the heavy work.

At the end of processing, the data is uploaded to Azure Blob storage. So, the data already needs to be transferred twice over the network (AFAIK) before the response is sent. Not ideal.

I am aware of the different queuing options in Azure, but am wary of making the situation worse rather than better. I don't want to overkill this problem, but do need to make the entire process run as quickly and efficiently as possible.

a) What kind of data transfer speeds can I expect between an Azure Web Role and Worker Role in a Cloud Service?

b) Is there any way to transfer the data directly to Azure storage and then process it there, without transferring it again?

c) Can / Will the worker role and web role actually run on the same machine?

d) Can I just run the .exe from inside the web app? How to get the path?

possible duplicate of [Windows Azure. Optimal architecture for processing and storing large data blocks quickly](http://stackoverflow.com/questions/17536694/windows-azure-optimal-architecture-for-processing-and-storing-large-data-blocks) — David Makogon, Jul 09 '13 at 03:56
you already asked this question. Why are you asking the exact same question again? Now, as far as your questions *a)* through *d)*, those are valid questions and can be asked independently with objective answers. And you can find answers to *a* and *c* in other questions. — David Makogon, Jul 09 '13 at 03:58
really? are you going to point me to these questions then? I've probably seen them already, and in fact they do not *actually* answer my questions. But if you think otherwise, post the links. Why did I ask the question again? because someone put it 'on hold'. This seemed frankly quite irritating to me and my response was to simply repost it. I am just tired of people policing my questions but not answering them in a useful way. — Tom, Jul 09 '13 at 04:26
A simple search of `[Azure] network bandwidth` would have returned several answers including [this one](http://stackoverflow.com/questions/17303044/windows-azure-virtual-machine-is-slow-to-access-network-when-scaling/17304384#17304384). NIC performance is for everything a VM does, including talking between VMs. I'll let you look up the rest. Please post additional questions for ones you are trying to get answered (in other words, don't ask for answers within comments - no way to upvote/downvote/favorite/etc.) — David Makogon, Jul 09 '13 at 16:30
Yes that is one of the questions I had seen already. My confusion was about how these VM sizes relate to cloud services and websites, which did not seem to have a size associated with them. — Tom, Jul 09 '13 at 19:25
I have decided to go with option d) fttb, do you have any comments on this architecture? — Tom, Jul 09 '13 at 19:28

score 1 · Answer 1 · answered Jul 08 '13 at 22:37

I would suggest a workflow similar to:

Client uploads data directly to Blob storage (in smaller chunks as per this guide)
When upload is finished client notifies your web service, and the web service posts a message on a Service Bus Queue (jobQueue). The message contains a unique session identifier and the Blob Url of the data uploaded. The web service then blocks and listens on another service bus queue (replyQueue) for the reply message with the specified sessionId.
A multi-threaded Worker role long polls the service bus jobQueue, for each message it receives they are processed, the processed data is then stored somewhere, and then a reply message is created and posted to the replyQueue with the sessionId set.
The web service will then receive the reply message (for the given sessionId) and can return a result to your client.

With an architecture similar to this you can scale vertically by using a bigger machine for your worker role, or horizontally by adding more instances of the worker role.

To make the process a bit more resilient you may want to return to the client instantly after the client has notified the web service of the uploaded data, and then the client can be signaled directly from the worker role using Signalr when the data has been processed.

My answers to the other parts of your question is:

a) I'm unsure what the guarantees of data transfer speeds are between the roles

b) Yes you can transfer data directly to Blob Storage, and then from the Blob to the Worker Role

c) You can run Worker Role style processing on the Web Role, call your Worker Role style code from the WebRole.OnRoleStart and WebRole.Run, then as you need to scale this code can be moved to its own dedicated Worker Role

Optimal architecture for processing and storing large data blocks quickly

1 Answers1