0

I need to read 1M rows from an IDataReader, and write n text files simultaneously. Each of those files will be a different subset of the available columns; all n text files will be 1M lines long when complete.

Current plan is one TransformManyBlock to iterate the IDataReader, linked to a BroadcastBlock, linked to n BufferBlock/ActionBlock pairs.

What I'm trying to avoid is having my ActionBlock delegate perform a using (StreamWriter x...) { x.WriteLine(); } that would open and close every output file a million times over.

My current thought is in lieu of ActionBlock, write a custom class that implements ITargetBlock<>. Is there is a simpler approach?

EDIT 1: The discussion is of value for my current problem, but the answers so far got hyper focused on file system behavior. For the benefit of future searchers, the thrust of the question was how to build some kind of setup/teardown outside the ActionBlock delegate. This would apply to any kind of disposable that you would ordinarily wrap in a using-block.

EDIT 2: Per @Panagiotis Kanavos the executive summary of the solution is to setup the object before defining the block, then teardown the object in the block's Completion.ContinueWith.

amonroejj
  • 573
  • 4
  • 16
  • Am I better off skipping TPL Dataflow and using _n_ BlockingCollection/Task.Run pairs? – amonroejj Jul 13 '21 at 17:25
  • What does your code really do? There's nothing wrong with TPL Datflow. In fact, since each block uses only 1 task by default, you could even use a FileStream created *outside* the block. If you need to write 1M lines though, a *better* solution would be to batch them and write the entire batch in one go instead of writing out line-by-line – Panagiotis Kanavos Jul 14 '21 at 08:52
  • `to n BufferBlock/ActionBlock` why? An ActionBlock already has an input BufferBlock – Panagiotis Kanavos Jul 14 '21 at 08:52
  • `ActionBlock already has an input BufferBlock` yes, but the nature of BroadcastBlock is that delivery isn't guaranteed if the ActionBlock falls behind. I will be running on high-RAM servers where buffer size is not a concern. – amonroejj Jul 14 '21 at 13:12
  • By default the ActionBlock has no capacity limit. You gain nothing by adding another BufferBlock. It will have the same issues as the ActionBlock alone. If you want to guarantee delivery you'll have to write extra code to send the message to the targets – Panagiotis Kanavos Jul 14 '21 at 13:22
  • `I will be running on high-RAM servers` why are you trying to avoid appending a line for every message then? The answer is you care about IO. And multiple IO operations are always slower than a single batch operation – Panagiotis Kanavos Jul 14 '21 at 13:24
  • `multiple IO operations are always slower than a single batch operation` this makes no sense (to me). `WriteLine()` is being called the exact same number of times. – amonroejj Jul 14 '21 at 13:31
  • `By default the ActionBlock has no capacity limit.` You're right. I looked back at an old test program I had written to test (and experienced firsthand) the BroadcastBlock "no guaranteed delivery" behavior a few months ago. I noticed that my ActionBlock had an explicit BoundedCapacity set. I would have no need to set a BoundedCapacity in my current project. – amonroejj Jul 14 '21 at 13:34
  • But does that correspond to actual IO? Again, the file stream is buffered. By writing everything at once you ensurer the data actually makes it to the disk. IO occurs only when the buffer is full, and if you care about eg 3 o 4 writes of 8KB, you can construct the complete string with a StringBuilder and write it all with `File.AppendAllTextAsync`. In all cases the code is a lot simpler and safer than handling a long-lived stream – Panagiotis Kanavos Jul 14 '21 at 13:36
  • Somewhat relevant: [BroadcastBlock with guaranteed delivery in TPL Dataflow](https://stackoverflow.com/questions/22127660/broadcastblock-with-guaranteed-delivery-in-tpl-dataflow) – Theodor Zoulias Jul 14 '21 at 14:04
  • 1
    @amonroejj I added a function that does use a single stream for all messages, but this *does* risk losing unwritten data, and does lock a file for the lifetime of the pipeline. – Panagiotis Kanavos Jul 14 '21 at 14:10
  • `What does your code really do?` I simplified it for the example, but technically, the DataReader is wrapped as an IEnumerable of a POCO class. I'm pivoting tall data to wide, so the loop over the POCOs must be stateful to know when it is time to start a new wide output line. The aspect of one POCO feeding multiple output text files still applies. I only want to iterate the POCOs once, regardless of the number of output text files, because the query behind the DataReader is the real heavy lifting. – amonroejj Jul 14 '21 at 14:50

2 Answers2

0

Often when using TPL, I will make custom classes so I can create private member variables and private methods that are used for blocks in my pipeline, but instead of implementing ITargetBlock or ISourceBlock, I'll just have whatever blocks I need inside of my custom class, and then I expose an ITargetBlock and or an ISourceBlock as public properties so that other classes can use the source and target blocks to link things together.

TJ Rockefeller
  • 3,178
  • 17
  • 43
0

Writing to a file one line at a time is expensive in itself even when you don't have to open the stream each time. Keeping a file stream open has other issues too, as file streams are always buffered, from the FileStream level all the way down to the file system driver, for performance reasons. You'd have to flush the stream periodically to ensure the data was written to disk.

To really improve performance you'd have to batch the records, eg with a BatchBlock. Once you do that, the cost of opening the stream becomes negligible.

The lines should be generated at the last possible moment too, to avoid generating temporary strings that will need to be garbage collected. At n*1M records, the memory and CPU overhead of those allocations and garbage collections would be severe.

Logging libraries batch log entries before writing to avoid this performance hit.

You can try something like this :

var batchBlock=new BatchBlock<Record>(1000);
var writerBlock=new ActionBlock<Record[]>( records => {
   
    //Create or open a file for appending
    using var writer=new StreamWriter(ThePath,true);
    foreach(var record in records)
    {
        writer.WriteLine("{0} = {1} :{2}",record.Prop1, record.Prop5, record.Prop2);
    }

});

batchBlock.LinkTo(writerBlock,options);

or, using asynchronous methods

var batchBlock=new BatchBlock<Record>(1000);
var writerBlock=new ActionBlock<Record[]>(async records => {
   
    //Create or open a file for appending
    await using var writer=new StreamWriter(ThePath,true);
    foreach(var record in records)
    {
        await writer.WriteLineAsync("{0} = {1} :{2}",record.Prop1, record.Prop5, record.Prop2);
    }

});

batchBlock.LinkTo(writerBlock,options);

You can adjust the batch size and the StreamWriter's buffer size for optimum performance.

Creating an actual "Block" that writes to a stream

A custom block can be created using the technique shown in the Custom Dataflow block walkthrough - instead of creating an actual custom block, create something that returns whatever is needed for LinkTo to work, in this case an ITargetBlock< T> :

ITargetBlock<Record> FileExporter(string path)
{
    var writer=new StreamWriter(path,true);
    var block=new ActionBlock<Record>(async msg=>{
        await writer.WriteLineAsync("{0} = {1} :{2}",record.Prop1, record.Prop5, record.Prop2);
    });

    //Close the stream when the block completes
    block.Completion.ContinueWith(_=>write.Close());
    return (ITargetBlock<Record>)target;
}
...


var exporter1=CreateFileExporter(path1);
previous.LinkTo(exporter,options);

The "trick" here is that the stream is created outside the block and remains active until the block completes. It's not garbage-collected because it's used by other code. When the block completes, we need to explicitly close it, no matter what happened. block.Completion.ContinueWith(_=>write.Close()); will close the stream whether the block completed gracefully or not.

This is the same code used in the Walkthrough, to close the output BufferBlock :

target.Completion.ContinueWith(delegate
{
   if (queue.Count > 0 && queue.Count < windowSize)
      source.Post(queue.ToArray());
   source.Complete();
});

Streams are buffered by default, so calling WriteLine doesn't mean the data will actually be written to disk. This means we don't know when the data will actually be written to the file. If the application crashes, some data may be lost.

Memory, IO and overheads

When working with 1M rows over a significant period of time, things add up. One could use eg File.AppendAllLinesAsync to write batches of lines at once, but that would result in the allocation of 1M temporary strings. At each iteration, the runtime would have to use at least as RAM for those temporary strings as the batch. RAM usage would start ballooning to hundreds of MBs, then GBs, before the GC fired freezing the threads.

With 1M rows and lots of data it's hard to debug and track data in the pipeline. If something goes wrong, things can crash very quickly. Imagine for example 1M messages stuck in one block because one message got blocked.

It's important (for sanity and performance reasons) to keep individual components in the pipeline as simple as possible.

Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236
  • `Writing to a file one line at a time is expensive in itself even when you don't have to open the stream each time.` Can anyone provide a cite for this claim (honest question, not snark)? – amonroejj Jul 14 '21 at 13:41
  • @amonroejj That's going to be an "it depends" situation. For starters, being "expensive" is a relative term. It's expensive compared to some things and not compared to others. Next of course writing a file is going to depend a lot on implementation details. If the file's on an SSD it'll behave a lot differently than a HDD or a network drive (and a network drive will vary wildly depending on the connection). – Servy Jul 14 '21 at 13:54
  • @amonroejj unless you use buffering, every stream write would result in an IO operation. `FileStream` is buffered by default for this reason. This results in fewer IO operations but there's always a chance that a crash will result in lost data. Keeping a file locked for a long time can cause other problems too. By batching you ensure the data is written when you expect it to *and* that the file is released – Panagiotis Kanavos Jul 14 '21 at 13:56
  • @amonroejj when you have 1M rows to write, things add up quickly. You also lose the ability to easily debug and track data, especially with complex pipelines. That's why I didn't use `WriteAllLinesAsync` - this would need generating 1M temporary strings. – Panagiotis Kanavos Jul 14 '21 at 13:58