2

I am writing a program which scours the entire filesystem of a computer in order to destroy any files which fall within certain parameters. I want the program to run as fast as possible and utilize as many resources as necessary to achieve this (it's worth noting that the user is expected not to be completing any other work while this process is taking place). To that end, I've written a method which takes a target directory, searches all the files in it, then queues up a new task for each child directory. This is currently done by passing the directories' paths into a queue which the main thread monitors and uses to actually initialize the new tasks, as so:

static class DriveHandler
{
    internal static readonly List<string> fixedDrives = GetFixedDrives();

    private static readonly ConcurrentQueue<string> _targetPathQueue = new ConcurrentQueue<string>();
    private static int _threadCounter = 0;

    internal static void WipeDrives()
    {
        foreach (string driveLetter in fixedDrives)
        {
            Interlocked.Increment(ref _threadCounter);
            Task.Run(() => WalkDrive(driveLetter));
        }
        while (Volatile.Read(ref _threadCounter) > 0 || !_targetPathQueue.IsEmpty)
        {
            if (_targetPathQueue.TryDequeue(out string path))
            {
                Interlocked.Increment(ref _threadCounter);
                Task.Run(() => WalkDrive(path));
            }
        }
    }

    private static void WalkDrive(string directory)
    {
        foreach (string file in Directory.GetFiles(directory))
        {
            //If file meets conditions, delete
        }
        string[] subDirectories = Directory.GetDirectories(directory);
        if (subDirectories.Length != 0)
        {
            foreach (string subDirectory in subDirectories)
            {
                _targetPathQueue.Enqueue(subDirectory);
            }
        }
        else { } //do other stuff;
        Interlocked.Decrement(ref _threadCounter);
    }
}

My question is, is it safe/worth it to just initialize the new tasks from within the already running tasks to avoid wasting processor time monitoring the queue? Something that looks like this:

static class DriveHandler
{
    internal static readonly List<string> fixedDrives = GetFixedDrives();

    private static int _threadCounter = 0;

    internal static void WipeDrives()
    {
        foreach (string driveLetter in fixedDrives)
        {
            Interlocked.Increment(ref _threadCounter);
            Task.Run(() => WalkDrive(driveLetter));
        }
        while (Volatile.Read(ref _threadCounter) > 0)
        {
            Thread.Sleep(5000);
        }
    }

    private static void WalkDrive(string directory)
    {
        foreach (string file in Directory.GetFiles(directory))
        {
            //If file meets conditions, delete
        }
        string[] subDirectories = Directory.GetDirectories(directory);
        if (subDirectories.Length != 0)
        {
            foreach (string subDirectory in subDirectories)
            {
                Interlocked.Increment(ref _threadCounter);
                Task.Run(() => WalkDrive(path));
            }
        }
        else { } //do other stuff;
        Interlocked.Decrement(ref _threadCounter);
    }
}

I of course need every task to die once it's done, will doing things this way make the old tasks parents to the new ones and keep them alive until all their children have finished?

Many thanks!

  • 1
    Everything you're doing is I/O-bound on the same device and you have many threads competing for that device's time, so it's likely that all this multi-threading is working against you, in terms of both code complexity and runtime performance. You would be better off (if for no other reason than for your own sanity) doing a single-threaded scan-and-delete using `async/await` throughout. – madreflection Apr 20 '22 at 20:16
  • 2
    Spoke too soon... there are no async file system methods yet. Being I/O-bound is still an issue, though, so throwing more threads at it will have vanishing returns. – madreflection Apr 20 '22 at 20:20
  • @madreflection It's worth noting that the parameters the program is checking against don't actually include the contents of the file, meaning that really the only reads I'm doing are getting the file structure and metadata. In tests so far, I haven't been able to saturate the I/O, as the the program reaches the maximum number of tasks before that point (something like 19k). That being said, these tests have only been in a VM so far, so physical disks might be a whole different story. I will likely limit the number of tasks more though, so thanks for the input! – Cailean Parker Apr 20 '22 at 20:47
  • 3
    It matters a lot whether you're working with an SSD or spinning rust. SSDs love multiple queued commands. Spinning disks can only have the heads in one location at a time, so accessing the same disk (which often, but not always, maps 1:1 to a drive letter... partitions do exist) from multiple threads wastes a lot of time seeking and gets very little actual work done. – Ben Voigt Apr 20 '22 at 20:49
  • @BenVoigt I hadn't considered that. Unfortunately this program will run on a relatively wide variety of machines, some with SSDs and some without. I think the balance is more in favor of the SSDs however. That being said, if it were just running on HDDs, do you think multithreading this process would be altogether pointless? I figure a few extra threads at least could be useful to be accessing the drive while others run checks on the files. – Cailean Parker Apr 21 '22 at 18:02
  • 2
    @CaileanParker: Multithreading with spinning disks is useful when (a) you have multiple disks in the computer or (b) you have significant work apart from the disk access. In that case go ahead and spin up threads/tasks to do the work... they can be CPU-heavy, or do a lot of non-disk I/O (network transfers, for example). But to avoid thrashing the heads, only the single thread assigned per disk should actually do I/O to the disk. – Ben Voigt Apr 21 '22 at 18:09
  • @BenVoigt Makes sense. I checked with one of our SysAdmins and they confirmed that SSDs are more common, and trashing the HDDs isn't really a concern (this program is only meant to be run at a machine's end-of-life). Thanks again for the input, it'll certainly be applicable in many other programs moving forward! – Cailean Parker Apr 21 '22 at 18:28

1 Answers1

2

First problem:

Task.Run(() => WalkDrive(path));

It's a fire and forget fashion, it's not a good thing to do in this context, why? Because chances are, you have waaay more files and paths on hard disk than a machine have CPU and memory capacity (task consume memory as well not just CPU). Fire and forget, hence the name, you keep spawning up tasks without awaiting them.

My question is, is it safe/worth it to just initialize the new tasks from within the already running tasks to avoid wasting processor time monitoring the queue?

It's valid, nothing can prevent you from doing that, but you are already wasting resources, why to spawn new task each time? You've got already one running, just make it a long running background task and keep it running, just two threads (I assume one is (UI/user facing) thread) and one doing the work. All these locks and tasks spawning is going to hurt your performance and waste all the resources CPU + memory allocations.

If you want to speed things up by parallel execution You could add the path to the concurrent queue, and have only 10-100 concurrent tasks MAX or whatever, at least you have an upper bound, you control how much the code is doing in parallel.

while conccurent-queue is not empty and no one request to cancel the operation:

  1. Start from base path
  2. Get all sub-paths and enqueue them inside the concurrent-queue
  3. Process files inside that path
  4. Make the current base path as the next item available inside the queue
  5. Start all over again.

You just start max number of concurrent tasks and that's it.

Your main loop/while condition is something like:

private async Task StartAsync(CancellationToken cancellationToken)
{
    var tasks = new List<Task>();

    for (int i = 0; i < MaxConcurrentTasks; i++)
    {
        tasks.Add(Task.Run(() => ProcessPath(initialPathHere), cancellationToken));
    }

    await Task.WhenAll(tasks);
}

And then something along these lines:

private static async Task ProcessPath(string path, CancellationToken cancellationToken)
{
    while(concurrentDictionary.Count > 0 && !cancellationToken.IsCancellationRequested)
    {
        foreach(var subPath in System.IO.Directory.EnumerateDirectories(path))
        {
            //Enqueue the subPath into the concurrent dictionary
        }

        //Once finished, process files in the current path
        
        foreach (var file in path)
        {
        }

        path = concurrentDictionary.Dequeue();
    }
}

Haven't checked the syntax but that's how a good algorithm would do it in my opinion. Also, please keep in mind that while the task finished its current job, queue might be empty in this line, so modify that code accordingly.

path = concurrentDictionary.Dequeue();

Final notes:

  1. Consider the trades between tasks and Parallel.Invok/execute
  2. Consider using BackgroundServices they are fine-tuned to be long running, depends on your code and requirements
  3. In order to work for performance gain, remember the golden rule, measure early. Start by measuring, have some metrics already at hand so later on if you want to speed things up a bit, you at least know how much you could do right now, so you refactor and measure again and compare and then you will know if you are getting closer or further from your target.
  4. Make sure you do conccurency/paralell processing right, otherwise it's going to go against you not with you.
Rickless
  • 1,377
  • 3
  • 17
  • 36
  • Thanks so much for the suggestions. I'm going to implement your algorithm. Just to explain my earlier though process, my understanding was that Tasks are really just an abstraction of `ThreadPool.QueueUserWorkItem`, meaning that the overhead when creating them was fairly minimal, since it's just grabbing threads that already exist and already has an inherent limit. It makes sense though that minimal overhead time 19k does still mean a lot of overhead, so I like the ideal of utilizing a more limited pool of longrunning threads. – Cailean Parker Apr 21 '22 at 19:20