-2

I have the following algorithm that write data to Azure blob storage

private const long MaxChunkSize = 1024 * 1024 * 4; // 4MB

private void UploadPagedDataToBlob(...)
{
    ...
    List<Task> list_of_tasks = new List<Task>(); 
    do
    {
         var stream = new MemoryStream(data, index, (int)blockSize);
         var task = _blob.WritePagesAsync(stream, startPosition, null);

         list_of_tasks.Add(task);
         ...
     }
     while (remainingDataLength > 0);
     Task.WaitAll(list_of_tasks.ToArray());
} 

If my file has size 628MB => then list_of_tasks has 157 tasks (628/MaxChunkSize). Usually I have more than 1 TB file. I don't want to have so much running tasks, how to create more efficient algorithm? What is the optimal number of running tasks? For example no more than 200, any recommendations?

Anatoly
  • 1,908
  • 4
  • 25
  • 47
  • It depends. A remote system may limit connections, on a single CPU you may want to limit it to cores if it's CPU bound. We don't know what '_blob' is so it's hard to answer. In general you'd be better off using Parallel.For or TPL DataFlow and let TPL decide how many tasks to run at once. – Ian Mercer Apr 25 '16 at 15:21
  • I answered a similar question some time back. It may be helpful: http://stackoverflow.com/a/32252521/1835769 – displayName Apr 25 '16 at 15:21
  • 4
    You're the one who can do the experiment to determine the optimal number of tasks for your scenario, not us. Design an experiment, carefully perform it, and you will know the answer. – Eric Lippert Apr 25 '16 at 15:31
  • This is really broad and doesn't have a specific answer. But I am curious though: Why are you uploading to page blobs vs block blobs? – David Makogon Apr 25 '16 at 16:25
  • @DavidMakogon The VHD must be stored as a page blob. – Anatoly Apr 26 '16 at 09:06

1 Answers1

1

For writing files to the same disk sequentially?

1.

Parallelism is only useful if you can actually run the tasks in parallel. Your shared bottleneck is the disk access, and that's not going to get any better if you issue multiple writes at the same time - rather, it might get much slower, and it will tend to fight for priorities with other things running on the same system.

Hard drives are pretty well optimized for sequential writing. If you're having throughput issues, just make your chunks bigger - but doing the writes in parallel is most likely going to hurt you rather than help.

If you're dealing with remote resources, you need to factor in the latency. If the latency is much higher than the time it takes to send one chunk, parallelising might be worthwhile to reduce "wasted" time - however, you also need to make sure everything is properly synchronized, and that there's no throttling that would hurt you.

Luaan
  • 62,244
  • 7
  • 97
  • 116
  • 1
    Not sure how this is an accepted answer, as the question isn't about writing to disk. It's about writing to Azure blob storage. More specifically, to a page blob (given the code has a call to `WritePagesAsync()`). Azure blob storage is designed for multiple simultaneous writes, and is not optimized like a hard drive. If the OP is trying to write to multiple blobs simultaneously, it's limited by per-blob transactions and per-storage-account -transactions per second (plus bandwidth). – David Makogon Apr 25 '16 at 16:22
  • Thanks, but I'm writing to one blob with offset asynchronously. So how many tasks I can create? – Anatoly Apr 26 '16 at 09:09
  • @Anatoly That's the "remote resources" part of my answer - figure out the latency, see if there's any throttling invovled, and ultimately, just try different configurations and pick the best. – Luaan Apr 26 '16 at 10:53