0

I have the following scenario.

I am doing splitting functionality by reading huge csv file line by line. Each line have categoryId. Based on that Id I need to write this line into separate file.

To do this I am doing the following:

  1. Reading the huge file line by line.
  2. After reading each line I open a new stream based on the categoryId (only if the stream is not already opened). Write the line into the stream and then keep the stream open, because there might be more lines in in the huge file.
  3. At the end after all lines from the huge file are processed I am closing all open streams. This forces flush and closes the connections.

My question is. Do I need to manually invoke Flush() on lets say -> every 100 lines recorded or this is something handled by StreamWriter itself. I read on the web that there is a buffer that automatically flushes when it is full, but I am not sure if this is true. My concern is that if it doesn't flush and waits for the end of the big file, I might end up with the whole file loaded in memory.

Here is part of the code to see what am talking about:

try
        {
            while (!reader.EndOfStream)
            {
                var line = await reader.ReadLineAsync();
                var locationId = line.Split(',')[0];
                var gdProjectId = GetGDProjectId(locationId);

                var blobName = $"{gdProjectId}/{DateTime.UtcNow.ToString("dd-MM-yyyy")}/{DateTime.UtcNow.ToString("HH-mm-ss")}-{Guid.NewGuid()}.csv";

                if (!openWriters.ContainsKey(gdProjectId))
                {
                    var blockBlobClient = containerClient.GetBlockBlobClient(blobName);
                    var newWriteStream = await blockBlobClient.OpenWriteAsync(true);
                    openWriters.Add(gdProjectId, new StreamWriter(newWriteStream, Encoding.UTF8));
                }

                var writer = openWriters[gdProjectId];
                await writer.WriteLineAsync(line);

                // SHOULD I MANUALLY INVOKE FLUSH ON EVERY {X} lines processed ?
                // TODO: Check if we need to manually flush or the streamwriter does it for us when the buffer is full.
                // await writer.FlushAsync();
            }
        }
        catch (Exception ex)
        {

            throw;
        }
        finally
        {
            // we are always closing the writers no matter if the operation is successful or not.
            foreach (var oStream in openWriters)
            {
                oStream.Value.Close();
            }
        }
Dobromir Ivanov
  • 313
  • 4
  • 12
  • 1
    Always post code instead of images. Sometimes people want to copy some code and put it in their answers. – apocalypse Apr 15 '21 at 17:40
  • 1
    You absolutely do not want to invoke Flush() until you're done writing. Doing so will cause the buffer to be flushed before it's full, defeating the purpose of the buffer. It is good practice to invoke Flush() explicitly when all data has been written, but as mentioned, exiting the using { } block will do it implicitly. I prefer to call it explicitly because if there's an exception writing to the underlying stream, it will be somewhat easier to diagnose. – glenebob Apr 15 '21 at 17:46
  • hey guys, I updated my example with code instead of screenshot. @glenebob my question if and when the automatic flush is invoked by the streamWriter ? I imagine lets say if the buffer is 1024 -> after it get full it will automatically flush and write to the target stream right ? My concern is to not load too many data in-memory and consume the whole RAM of the machine. Do you know how many characters I can fit in the writer before it get flushed automatically ? I will be in a situation where I can have a lot of open streams at the same time and i don't want to consume all the ram. – Dobromir Ivanov Apr 15 '21 at 17:52
  • _"i don't want to consume all the ram"_ -- you don't have any control over that. Even if you call `Flush()`, and even though the `StreamWriter.Flush()` method explicitly flushes the underlying stream, there are more layers to the file I/O than that, such as the OS cache. More to the point, these buffers are only some number of K large; they are way too small to have any material effect on memory overhead and even if they did, **the buffer exists whether you flush or not**. The only reasons to call `Flush()` explicitly is when you have some _specific_ reason to make sure the data has been ... – Peter Duniho Apr 15 '21 at 17:59
  • ... written, such as you're writing to a network stream and don't want the data to be delayed, or you're writing to a log file and want to make sure each line has been written in case the process crashes, things like that. Note also that the comment above from @jdweng is mostly wrong. There's no timer, and you don't need to call `Flush()` when closing the writer, because closing/disposing the writer will _always_ flush the data automatically as part of that operation. – Peter Duniho Apr 15 '21 at 17:59
  • There is no harm in flushing except the time it takes. You need to flush before closing because the close does not flush. Seen 100's of times were just closing not all the data is put into the file. – jdweng Apr 15 '21 at 18:10
  • "There is no harm in flushing except the time it takes"... The buffer exists precisely to minimize the time it takes to write all the date to the underlying stream. Calling Flush() unnecessarily defeats the only purpose of the buffer. – glenebob Apr 15 '21 at 19:31
  • @nzhul The writer flushes automatically whenever the buffer fills up. The size of the buffer can be controlled. – glenebob Apr 15 '21 at 19:32

1 Answers1

1

Flush (in StreamWriter implementation) just sends data from buffer to underlying stream then calls Flush on underlying stream, i.e. (pseudo code):

underlyingStream.Write(GetDataFromBuffer());
bufferPosition = 0; // "clears" buffer
underlyingStream.Flush();

Buffer size is constant. By default it is something around 2-4KiB. But it can be set manually in constructor for larger values. Flush does not change buffer size. So calling Flush every 100 lines gives you nothing.

Q: "Do I need to manually invoke Flush() on lets say -> every 100 lines..."

No. It will not save you any memory. It will just write data to underlying stream earlier - i.e. it will not wait for buffer to be full.
Hint: if property AutoFlush is set to true, Flush will be automatically called after each WriteXYZ method call.

Q: "My concern is that if it doesn't flush and waits for the end of the big file, I might end up with the whole file loaded in memory."

Buffer size is constant. Calling Flush won't help.

But...

everything is true just from StreamWriter perspective.
Because StreamWriter just holds reference to some Stream instance, you cannot predict memory usage without knowing concrete implementation of that Stream instance (Stream is abstract).

You should create new question like: "Do I need to manually flush XyzStream"? (if there is not question like that already posted).

apocalypse
  • 5,764
  • 9
  • 47
  • 95
  • Thank you for the detailed answer! I am using azure blob storage stream. I noticed that even after trying to change the buffer size to 4096 for example, when I inspect the buffer size is still 4194304. I guess their implementation of the stream doesn't allow me to change the buffer. Anyhow, thanks for the advice to change my question. I will first try to google and be specific that I am using blob storage stream. If cannot find more information I will open another question here. – Dobromir Ivanov Apr 15 '21 at 18:11
  • I did some research after your commend and found out that azure blob stream have different min,max,default value for buffer size. The default value is 4MB, max is 4000MB and min is 1MB. Here is how I can configure it. var newWriteStream = await blockBlobClient.OpenWriteAsync(true, new Azure.Storage.Blobs.Models.BlockBlobOpenWriteOptions { BufferSize = 1048576 }); This answers my question, why I couldn't force the buffer to write to the end stream previusly -> because I was sending less than 4MB (the default) :) – Dobromir Ivanov Apr 15 '21 at 18:23
  • @nzhul: Calling flush on network streams usually kills perormance. There is no problem with memory usage when you have opened few streams. But as I see you are creating new streams based on some IDs. So it can be a problem when you create 100 streams at once (400 MB memory usage, exluding resourced used by network connections). – apocalypse Apr 15 '21 at 18:31
  • thats my concert exactly. I might end up with even more than 100 streams opened at a time. I think the bussiness case is something like this -> We need to split one huge file into 400-500 smaller files and and each file should contain data based on this categoryId that I mentioned. – Dobromir Ivanov Apr 15 '21 at 18:47
  • @nzhul: then maybe change strategy, it will required to make 2 passes over input file: 1) scan input file and save following list of indexes: line_start_position and line_project_id. 2) then filter list by project_id, and start saving lines to azure blob, like: inputReader.SetPosition(line_start_position); string csvLine = inputReader.ReadCsvLine(); then azureStream.WriteLine(csvLine); Also you should use something like CsvHelper library instead of doing string.Split. – apocalypse Apr 15 '21 at 19:07
  • that's very good advice. I will consider it and experiment with it if I hit a problem with opening a ton of streams. Another think that came to my mind is that I can keep track when was the last time a stream was used, and if it was not used for too long time/lines I can close it temporary and then open it again if there is a need for it. – Dobromir Ivanov Apr 15 '21 at 19:58