Type of threading to use in c# for heavy IO operations

Question

I am tasked with updating a c# application (non-gui) that is very single-threaded in it's operation and add multi-threading to it to get it to turn queues of work over quicker.

Each thread will need to perform a very minimal amount of calculations, but most of the work will be calling on and wait on SQL Server requests. So, lots of waiting as compared to CPU time.

A couple of requirements will be:

Running on some limited hardware (that is, just a couple of cores). The current system, when it's being "pushed" only takes about 25% CPU. But, since it's mostly doing waits for the SQL Server to respond (different server), we would like to the capability to have more threads than cores.
Be able to limit the number of threads. I also can't just have an unlimited number of threads going either. I don't mind doing the limiting myself via an Array, List, etc.
Be able to keep track of when these threads complete so that I can do some post-processing.

It just seems to me that the .NET Framework has so many different ways of doing threads, I'm not sure if one is better than the other for this task. I'm not sure if I should be using Task, Thread, ThreadPool, something else... It appers to me that async \ await model would not be a good fit in this case though as it waits on one specific task to complete.

@Enigmativity The current code is quite a large project, I wouldn't even know where to start to carve it up for even some snips. That's why I was asking a high-level question to try and get back a high-level answer. Sorry. — Jim, Aug 19 '20 at 22:29
Then you should try to give some examples signatures of the calls that you're making at least. The tool to do the job need to match. It's a bit like you've asked us how to cut some wood - there are a million tools for cutting wood - but do you need a jigsaw or an axe? — Enigmativity, Aug 20 '20 at 00:27
How are the queues of work managed in your current app? How does work get triggered? What's happening with the results? — Enigmativity, Aug 20 '20 at 00:37

H H · Answer 1 · 2020-08-20T07:21:33.513

I'm not sure if I should be using Task, Thread, ThreadPool, something else...

In your case it matters less than you would think. You can focus on what fits your (existing) code style and dataflow the best.

since it's mostly doing waits for the SQL Server to respond

Your main goal would be to get as many of those SQL queries going in parallel as possible.

Be able to limit the number of threads.

Don't worry about that too much. On 4 cores, with 25% CPU, you can easily have 100 threads going. More on 64bit. But you don't want 1000s of threads. A .net Thread uses 1MB minimum, estimate how much RAM you can spare.

So it depends on your application, how many queries can you get running at the same time. Worry about thread-safety first.

When the number of parallel queries is > 1000, you will need async/await to run on fewer threads.

As long as it is < 100, just let threads block on I/O. Parallel.ForEach() , Parallel.Invoke() etc look like good tools.

The 100 - 1000 range is the grey area.

score 2 · Answer 2 · edited Aug 24 '20 at 01:29

add multi-threading to it to get it to turn queues of work over quicker.

Each thread will need to perform a very minimal amount of calculations, but most of the work will be calling on and wait on SQL Server requests. So, lots of waiting as compared to CPU time.

With that kind of processing, it's not clear how multithreading will benefit you. Multithreading is one form of concurrency, and since your workload is primarily I/O-bound, asynchrony (and not multithreading) would be the first thing to consider.

It just seems to me that the .NET Framework has so many different ways of doing threads, I'm not sure if one is better than the other for this task.

Indeed. For reference, Thread and ThreadPool are pretty much legacy these days; there are much better higher-level APIs. Task should also be rare if used as a delegate task (e.g., Task.Factory.StartNew).

It appers to me that async \ await model would not be a good fit in this case though as it waits on one specific task to complete.

await will wait on one task at a time, yes. Task.WhenAll can be used to combine multiple tasks and then you can await on the combined task.

get it to turn queues of work over quicker.

Be able to limit the number of threads.

Be able to keep track of when these threads complete so that I can do some post-processing.

It sounds to me that TPL Dataflow would be the best approach for your system. Dataflow allows you to define a "pipeline" through which data flows, with some steps being asynchronous (e.g., querying SQL Server) and other steps being parallel (e.g., data processing).

I was asking a high-level question to try and get back a high-level answer.

read more:

Releasing threads during async tasks

https://learn.microsoft.com/en-us/dotnet/standard/asynchronous-programming-patterns/?redirectedfrom=MSDN

I can't think of any way `ContinueWith` is "better" than `await`. — Stephen Cleary, Aug 19 '20 at 22:38

Theodor Zoulias · Answer 4 · 2020-08-20T00:29:34.267

The TPL Dataflow library is probably one of the best options for this job. Here is how you could construct a simple dataflow pipeline consisting of two blocks. The first block accepts a filepath and produces some intermediate data, that can be later inserted to the database. The second block consumes the data coming from the first block, by sending them to the database.

var inputBlock = new TransformBlock<string, IntermediateData>(filePath =>
{
    return GetIntermediateDataFromFilePath(filePath);
}, new ExecutionDataflowBlockOptions()
{
    MaxDegreeOfParallelism = Environment.ProcessorCount // What the local machine can handle
});

var databaseBlock = new ActionBlock<IntermediateData>(item =>
{
    SaveItemToDatabase(item);
}, new ExecutionDataflowBlockOptions()
{
    MaxDegreeOfParallelism = 20 // What the database server can handle
});

inputBlock.LinkTo(databaseBlock);

Now every time a user uploads a file, you just save the file in a temp path, and post the path to the first block:

inputBlock.Post(filePath);

And that's it. The data will flow from the first to the last block of the pipeline automatically, transformed and processed along the way, according to the configuration of each block.

This is an intentionally simplified example to demonstrate the basic functionality. A production-ready implementation will probably have more options defined, like the CancellationToken and BoundedCapacity, will watch the return value of inputBlock.Post to react in case the block can't accept the job, may have completion propagation, watch the databaseBlock.Completion property for errors etc.

If you are interested at following this route, it would be a good idea to study the library a bit, in order to become familiar with the options available. For example there is a TransformManyBlock available, suitable for producing multiple outputs from a single input. The BatchBlock may also be useful in some cases.

The TPL Dataflow is built-in the .NET Core, and available as a package for .NET Framework. It has some learning curve, and some gotchas to be aware of, but it's nothing terrible.

Type of threading to use in c# for heavy IO operations

4 Answers4