9

Say the method below is being called several thousand times by different threads in a .net 4 application. What’s the best way to handle this situation? Understand that the disk is the bottleneck here but I’d like the WriteFile() method to return quickly.

Data can be can be up to a few MB. Are we talking threadpool, TPL or the like?

public void WriteFile(string FileName, MemoryStream Data)
{
   try
   {
      using (FileStream DiskFile = File.OpenWrite(FileName))
      {
         Data.WriteTo(DiskFile);
         DiskFile.Flush();
         DiskFile.Close();
      }
   }
   catch (Exception e)
   {
      Console.WriteLine(e.Message);
   }
}
SwDevMan81
  • 48,814
  • 22
  • 151
  • 184
Canacourse
  • 1,805
  • 2
  • 20
  • 37

4 Answers4

6

If you want to return quickly and not really care that operation is synchronous you could create some kind of in memory Queue where you will be putting write requests , and while Queue is not filled up you can return from method quickly. Another thread will be responsible for dispatching Queue and writing files. If your WriteFile is called and queue is full you will have to wait until you can queue and execution will become synchronous again, but that way you could have a big buffer so if process file write requests is not linear , but is more spiky instead (with pauses between write file calls spikes) such change can be seen as an improvement in your performance.

UPDATE: Made a little picture for you. Notice that bottleneck always exists, all you can possibly do is optimize requests by using a queue. Notice that queue has limits, so when its filled up , you cannot insta queue files into, you have to wait so there is a free space in that buffer too. But for situation presented on picture (3 bucket requests) its obvious you can quickly put buckets into queue and return, while in first case you have to do that 1 by one and block execution.

Notice that you never need to execute many IO threads at once, since they will all be using same bottleneck and you will just be wasting memory if you try to parallel this heavily, I believe 2 - 10 threads tops will take all available IO bandwidth easily, and will limit application memory usage too.

enter image description here

Valentin Kuzub
  • 11,703
  • 7
  • 56
  • 93
  • Tried something like that. Placed a ConcurrentBag in front of WriteFile but the problem is that files are coming to me from a 3rd party callback so the ConcurrentBag ended up with tons of files and never have a chance to empty before more files came in. – Canacourse Sep 14 '11 at 19:37
  • 2
    Well thats why you need to check whether queue is full before you can return. Setup some maximum size for queue, and check whether its filled less than that. If your writing speed is 10mb sec and your incoming requests are so high that to store them you will need 1GB sec theres nothing you can do here, without serious hardware changes. – Valentin Kuzub Sep 14 '11 at 19:59
3

Since you say that the files don't need to be written in order nor immediately, the simplest approach would be to use a Task:

private void WriteFileAsynchronously(string FileName, MemoryStream Data)
{
    Task.Factory.StartNew(() => WriteFileSynchronously(FileName, Data));
}

private void WriteFileSynchronously(string FileName, MemoryStream Data)
{
    try
    {
        using (FileStream DiskFile = File.OpenWrite(FileName))
        {
            Data.WriteTo(DiskFile);
            DiskFile.Flush();
            DiskFile.Close();
        }
    }

    catch (Exception e)
    {
        Console.WriteLine(e.Message);
    }
}

The TPL uses the thread pool internally, and should be fairly efficient even for large numbers of tasks.

Cameron
  • 96,106
  • 25
  • 196
  • 225
  • @Canacourse: Nope! The thread pool [limits the number of actual threads](http://msdn.microsoft.com/en-us/library/system.threading.threadpool.getmaxthreads.aspx) carrying out the work. See [this blog post](http://blogs.msdn.com/b/jennifer/archive/2009/06/26/work-stealing-in-net-4-0.aspx) on work-stealing queues for a nice high-level explanation of how tasks are implemented – Cameron Sep 14 '11 at 19:41
  • this approach has its limitations since you are not trying to control the flow anyhow either. Imagine write on disk speed is 1kb per sec (to see the problem more clearly). If your getting write requests from web with megabytes per second your application will blowup quickly. – Valentin Kuzub Sep 14 '11 at 20:13
  • 1
    @Valentin: Good point. If that's the case for the OP, then my solution is terrible! If the queues consistently fill up more quickly than they can be emptied, then at some point we'd have to stop filling the queues and wait for them to drain a bit. Your answer is much better in such a case (mine is just simple). Nice diagram, btw :-) – Cameron Sep 14 '11 at 20:38
  • Opps.just discovered I had not accepted an answer. Its interesting that .net 4.5 will have Asynchronous File I/O built in. – Canacourse Apr 25 '12 at 09:08
  • I suppose you meant to name first function differently. perhaps `WriteFileASynchronous`? – 2i3r Apr 13 '22 at 07:37
  • It would appear so! – Cameron Apr 14 '22 at 00:07
2

If data is coming in faster than you can log it, you have a real problem. A producer/consumer design that has WriteFile just throwing stuff into a ConcurrentQueue or similar structure, and a separate thread servicing that queue works great ... until the queue fills up. And if you're talking about opening 50,000 different files, things are going to back up quick. Not to mention that your data that can be several megabytes for each file is going to further limit the size of your queue.

I've had a similar problem that I solved by having the WriteFile method append to a single file. The records it wrote had a record number, file name, length, and then the data. As Hans pointed out in a comment to your original question, writing to a file is quick; opening a file is slow.

A second thread in my program starts reading that file that WriteFile is writing to. That thread reads each record header (number, filename, length), opens a new file, and then copies data from the log file to the final file.

This works better if the log file and the final file are are on different disks, but it can still work well with a single spindle. It sure exercises your hard drive, though.

It has the drawback of requiring 2X the disk space, but with 2-terabyte drives under $150, I don't consider that much of a problem. It's also less efficient overall than directly writing the data (because you have to handle the data twice), but it has the benefit of not causing the main processing thread to stall.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • well if IO access is a bottleneck, then you suggest having 2 writers working at the same time, so bottleneck becomes 2x smaller for writing. One thread is writing to disk non stop, another is reading AND writing somewhere else. if before queue filled up in X time , now it fills up much faster, its not even X/2. Imagine that you are getting 1gb files incoming with speed that equals your disk writing speed. Now your solution will simply not work, unlike mine, which will use whole IO disk write speed available, not half or even less of it. – Valentin Kuzub Sep 15 '11 at 02:15
  • For this sample case, in order to be able to even let first thread do its job in writing raw data (which won't get us anywhere near end result, of having actual files written on disk) you will have to completely disable second thread, or you got a jam that will blowup memory quickly. – Valentin Kuzub Sep 15 '11 at 02:20
  • @Valentin: You're right in that the overall performance of the system will be lower with my approach. It will take more time. But it appears from the OP's question that the problem is keeping the workers from stalling. My solution does that because the `WriteFile` method just appends to a file. That is a very fast operation. The bottleneck is creating new files, which is handled by the separate thread. There will be some IO contention, yes. But the worker threads won't block. I have written and do use something that's very similar to what I described. And it does work as advertised. – Jim Mischel Sep 15 '11 at 15:00
  • well if we are talking abstract figures, yes opening is more expensive, but its not related to maximum IO bandwidth. Maximum IO bandwidth is fixed and if our proces is already optimized to write on maximum speed (by means of any approach,for simplicity we can say that generally incoming files are huge so its mostly writing) itroducing a second thread that will be reading and writing that can be a performance killer. since we got X (in speed) and Y (out to disk speed) if Y-X > 0 we are doing allright, but once Y-X becomes <0 by single byte per second we are doomed to get application crashed. – Valentin Kuzub Sep 15 '11 at 20:45
  • @Valentin: Undoubtedly there is a limit where this will fail. But this should work fine for the OP's application. As long as the OS's write cache is large enough to hold the data that comes in while the other thread is reading and writing a single file, things work just fine. The write cache will buffer, and then the OS will flush to the file in one big write. Assuming, of course, that there is a point when data stops coming in--or at least slows enough that the write-behind thread can catch up. But that assumption is built into any caching scheme – Jim Mischel Sep 15 '11 at 21:49
0

Encapsulate your complete method implementation in a new Thread(). Then you can "fire-and-forget" these threads and return to the main calling thread.

    foreach (file in filesArray)
    {
        try
        {
            System.Threading.Thread updateThread = new System.Threading.Thread(delegate()
                {
                    WriteFileSynchronous(fileName, data);
                });
            updateThread.Start();
        }
        catch (Exception ex)
        {
            string errMsg = ex.Message;
            Exception innerEx = ex.InnerException;
            while (innerEx != null)
            {
                errMsg += "\n" + innerEx.Message;
                innerEx = innerEx.InnerException;
            }
            errorMessages.Add(errMsg);
        }
    }
Leon
  • 3,311
  • 23
  • 20
  • 2
    fire & forget & crash application because you dont control number of those concurrent threads anyhow? – Valentin Kuzub Sep 14 '11 at 20:00
  • @Valentin Kuzub: This has worked in my experience beautifully. 20-100-150 threads at any one time and it works just fine. I have used it for data processing and heavy web services calls. Look at Thread.Join() if you want to fire off multiple threads concurrently and then wait for all of them to comeback before continuing. – Leon Sep 14 '11 at 20:13
  • Well I point you on a clear bug in your approach and you say its beautiful. I have nothing to add. – Valentin Kuzub Sep 14 '11 at 20:15
  • Where is the bug? You can throttle how many threads are created if you anticipate heavy load. – Leon Sep 14 '11 at 20:17
  • The app in question can contain 50,000+ files most of which have to be commited to disk. – Canacourse Sep 14 '11 at 20:18
  • well you say 150 threads, how about 10000 threads with 1kb files to write? Or more? do you still consider it beautiful or? – Valentin Kuzub Sep 14 '11 at 20:19
  • Canacourse: you can do it in "batches". Each batch will be sync, but within each batch it will be assync. – Leon Sep 14 '11 at 20:19
  • Also Leon initially method doesn't present you an array of files. You are given a lot of calls to method with single file, so if you would plan to group them up into chunks first, so your approach can work, we get to new questions. Say you plan to write chunks of 50 files, and you got 30 requests . You still don't write anything? What if users expect to get their files into file system not immediately, but at SOME POINT after sending a request. here if more requests don't come in, they won't see their files flushed on disk ever. – Valentin Kuzub Sep 14 '11 at 20:23
  • This is called a processing applications - some data comes in, something happens to it, and it's put somewhere else. You can use a Queue which gets filled up by your calls. A timer reads from the queue and saves files assync. You're right, I didn't provide a complete framework for building a processing service. – Leon Sep 14 '11 at 20:27
  • aha so now your solutions sounds pretty much like mine ;) just have to say it doesn't make too much sense to execute multiple file writing threads cause it doesn't scale, a couple of threads or maybe 3-4 will take all available IO bandwidth easily. 150 threads is not good for this tasks, your just wasting 150mb memory for them. – Valentin Kuzub Sep 14 '11 at 20:29
  • Agree disk IO, doesn't scale well. At least on Windows, IO access is "optimized" by both HD cache and OS. At my last employer we've discovered things get written quicker with a few (10 or less) concurrent threaded "writes" than one at a time. More threads, and this "speedup" disappears and eventually becomes just like sync. – Leon Sep 14 '11 at 20:40