11

My application needs to open a lot of small files, say 1440 files each containing data of 1 minute to read all the data of a certain day. Each file is only a couple of kB big. This is for a GUI application, so I want the user (== me!) to not have to wait too long.

It turns out that opening the files is rather slow. After researching, most time is wasted in creating a FileStream (OpenStream = new FileStream) for each file. Example code :

// stream en reader aanmaken
FileStream OpenStream;
BinaryReader bReader;

foreach (string file in files)
{
    // bestaat de file? dan inlezen en opslaan
    if (System.IO.File.Exists(file))
    {
        long Start = sw.ElapsedMilliseconds;

        // file read only openen, anders kan de applicatie crashen
        OpenStream = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);

        Tijden.Add(sw.ElapsedMilliseconds - Start);

        bReader = new BinaryReader(OpenStream);

        // alles in één keer inlezen, werkt goed en snel
        // -bijhouden of appenden nog wel mogelijk is, zonodig niet meer appenden
        blAppend &= Bestanden.Add(file, bReader.ReadBytes((int)OpenStream.Length), blAppend);

        // file sluiten
        bReader.Close();
    }
}

Using the stopwatch timer, I see that most (> 80%) of the time is spent on creating the FileStream for each file. Creating the BinaryReader and actually reading the file (Bestanden.add) takes almost no time.

I'm baffled about this and cannot find a way to speed it up. What can I do to speed up the creation of the FileStream?

update to the question:

  • this happens both on windows 7 and windows 10
  • the files are local (on a SSD disk)
  • there are only the 1440 files in a directory
  • strangely, reading the (same) files again later, creating the FileStreams suddenly cost almost no time at all. Somewhere the OS is remembering the filestreams.
  • even if I close the application and restart it, opening the files "again" also costs almost no time. This makes it pretty hard to find the performance issue. I had to make a lot of copies of directory to recreate the problem over and over.
wvl_kszen
  • 183
  • 1
  • 10
  • Seems like a possible O/S issue. What type of O/S are you accessing? Is it local or on a network (off the pc that is running the app)? Do the directories contain other files (ie. windows has a recommended limitation on number of files per directory). – Igor Jul 09 '17 at 11:05
  • This is on both windows 7 AND windows 10. The files are local in a directory containing just those 1440 files. I just realized I forgot the mention something : - it is only slow the first time I want to read the files - if i read the files again from the application, creating the FileStreams suddenly cost almost no time (how can this be? is the OS remembering the file handle? My application certainly is not). - if I close the application and start again, reading the same files AGAIN almost cost no time at all. There must be some kind of buffering/memory in the OS. – wvl_kszen Jul 09 '17 at 11:10
  • Have you tried [File.RealAllBytes](https://msdn.microsoft.com/en-US/library/system.io.file.readallbytes(v=vs.110).aspx)? – ganchito55 Jul 09 '17 at 11:21
  • I just tested with File.ReadAllBytes and the behaviour is the same (except that you cannot see anymore where the delay comes from exactly). Reading the files again also costs almost no time. – wvl_kszen Jul 09 '17 at 11:28
  • 2
    Windows does cache files in memory so faster subsequent access is not surprsing. You can clear the standby list using https://technet.microsoft.com/en-us/sysinternals/ff700229.aspx – user6144226 Jul 09 '17 at 11:37
  • Aside from making your file reading parallel there doesn't seem much you can do. – user6144226 Jul 09 '17 at 11:43
  • I understand that windows can cache files. But it is only creating the FileStream that gets faster the second time, not the actual reading of the file (which happens in Bestanden.Add()). Does creating the FileStream also read the first couple of K of a file? That would explain what is happening, since the files are only 2-3kB each. – wvl_kszen Jul 09 '17 at 11:47
  • It might be possible to get some more details about what the O/S is doing using [SysInternal's DiskMon and ProcessMonitor](https://technet.microsoft.com/en-us/sysinternals/bb545027). Either way your c# app is just a test bed at this point as the issue is likely not related to your c# code. – Igor Jul 09 '17 at 11:49
  • I tested using RamMap (thanks user6144226) and removed the actual reading of the file in my program. Surprise surprise : only creating the FileStream (and not reading from it) is enough for the OS to cache the file and put the first 4K of the file in 'standby' (so in memory). This explains what is happening!! – wvl_kszen Jul 09 '17 at 12:07

2 Answers2

2

As you have mentioned in the comment to the question FileStream reads first 4K to buffer by creating the object. You can change the size of this buffer to reflect better size of your data. (Decrease if your files are smaller than the buffer, for example). If you read file sequentially, you can give OS the hint about this through FileOptions. In addition, you can avoid BinaryReader, because you read files entirely.

    // stream en reader aanmaken
    FileStream OpenStream;

    foreach (string file in files)
    {
        // bestaat de file? dan inlezen en opslaan
        if (System.IO.File.Exists(file))
        {
            long Start = sw.ElapsedMilliseconds;

            // file read only openen, anders kan de applicatie crashen
            OpenStream = new FileStream(
                file,
                FileMode.Open,
                FileAccess.Read,
                FileShare.ReadWrite,
                bufferSize: 2048, //2K for example 
                options: FileOptions.SequentialScan);

            Tijden.Add(sw.ElapsedMilliseconds - Start);

            var bufferLenght = (int)OpenStream.Length;
            var buffer = new byte[bufferLenght];
            OpenStream.Read(buffer, 0, bufferLenght);

            // alles in één keer inlezen, werkt goed en snel
            // -bijhouden of appenden nog wel mogelijk is, zonodig niet meer appenden
            blAppend &= Bestanden.Add(file, buffer, blAppend);
        }
    }

I do not know type of Bestanden object. But if this object has methods for reading from array you can also reuse buffer for files.

    //the buffer should be bigger than the biggest file to read
    var bufferLenght = 8192;
    var buffer = new byte[bufferLenght];

    foreach (string file in files)
    {
        //skip 
        ...
        var fileLenght = (int)OpenStream.Length;
        OpenStream.Read(buffer, 0, fileLenght);

        blAppend &= Bestanden.Add(file, /*read bytes from buffer */, blAppend);

I hope it helps.

Ivan R.
  • 1,875
  • 1
  • 13
  • 11
  • I didn't notice yet that I can set the buffersize. This (the standard buffersize of 4K) explains all the streaming examples that read pieces of 4K all the time. I parallelized the opening of all the files, which cost me some time since my Bestanden class turned out not to be thread-safe. I have the feeling that for small files, it's better to read them in parallel and for bigger files to read them one-by-one. I might have to experiment with this. Reading the files in parallel speeded everything up a factor of +- 3x. – wvl_kszen Jul 17 '17 at 16:38
  • [Through constructor parameter - bufferSize](https://msdn.microsoft.com/en-us/library/d0y914c5(v=vs.110).aspx) – Ivan R. Jul 17 '17 at 16:44
0

Disclaimer: this answer is just a (founded) speculation that it's rather a Windows bug than something you can fix with different code.

So this behaviour might relate to the Windows bug described here: "24-core CPU and I can’t move my mouse".

These processes were all releasing the lock from within NtGdiCloseProcess.

So if FileStream uses and holds such a critical lock in the OS, it would wait a few µSecs for every file which would add up for thousands of files. It may be a different lock, but the above mentioned bug at least adds the possibility of a similar problem.

To prove or disprove this hypothesis some deep knowledge about the inner workings of the kernel would be necessary.

zx485
  • 28,498
  • 28
  • 50
  • 59