0

I'm trying to optimize the performance of creating a lot of small files to a SSD disk.

ConcurrentBag<string[]> cb = new ConcurrentBag<string[]>();
cb.AsParallel().ForAll(fa => File.WriteAllText(fa[0], fa[1]));

Total count of the ConcurrentBag<string[]> = 80048, cb.Sum(gbc => Encoding.UTF8.GetByteCount( gbc[1] ) ); returns 393441217 bytes.

Somewhere else I do a xml.Save();, which creates a ~750MB file.

The first situation takes 3 minutes and 30 seconds to complete. The second 20 seconds.

I understand there is some overhead to handle all the seperate write operations but 3 minutes and 30 seconds still seems a bit long. I already tried parallelization with forall, which helped pretty good (before that it took between 6-8 minutes to complete). What other modifications could I add to my code to optimize performance of the bulk file creation?

BigChief
  • 1,413
  • 4
  • 24
  • 37
  • you can write to disk concurrently from threads but it is written to the disk in a single stream, so I didn't realize this at the time of writing this question – BigChief Dec 04 '21 at 13:43

2 Answers2

1

Actually, multiple simultaneous IO operations can slow things down quite a lot, especially on traditional disks. I recommend using ConcurrentQueue for writing multiple files.

Also you could switch to StreamWriter and control buffer size to increase write speed:

    ConcurrentQueue<string[]> concurrentQueue = new ConcurrentQueue<string[]>();

    // populate with some data
    for (int i = 0; i < 5000; i++)
    {
        concurrentQueue.Enqueue(new string[] { Guid.NewGuid().ToString(), Guid.NewGuid().ToString() });
    }

    while (true)
    {
        string[] currentElement;
        bool success = concurrentQueue.TryDequeue(out currentElement);
        if (success)
        {
            const int BufferSize = 65536;  // change it to your needs
            using (var sw = new StreamWriter(currentElement[0], true, Encoding.UTF8, BufferSize))
            {
                sw.Write(currentElement[1]);
            }
        }
    }
Daniel Luberda
  • 7,374
  • 1
  • 32
  • 40
  • How to determine appropriate buffersize? This is not a traditional disk but a SSD, should this be determined first... Do you have an example of concurrentqueue with File.WriteAllText or StreamWriter? Also I tried with buffersize = 16384 (4*default) but this caused an outofmemoryexception – BigChief Aug 04 '15 at 15:38
  • 1
    I still encourage you to switch to ConcurrentQueue, I'll update my answer with example. SSD are also affected (but a lot less). See https://technet.microsoft.com/en-us/library/cc938632.aspx and http://stackoverflow.com/questions/8803515/optimal-buffer-size-for-write2 – Daniel Luberda Aug 04 '15 at 15:47
1

you should also try to use ForEach instead of the ForAll. you can find some good reasons in the post http://reedcopsey.com/2010/02/03/parallelism-in-net-part-8-plinqs-forall-method/

the post guideline is

The ForAll extension method should only be used to process the results of a parallel query, as returned by a PLINQ expression

silver
  • 1,633
  • 1
  • 20
  • 32