8

For my WPF project, I have to calculate the total file size in a single directory (which could have sub directories).

Sample 1

DirectoryInfo di = new DirectoryInfo(path);
var totalLength = di.EnumerateFiles("*.*", SearchOption.AllDirectories).Sum(fi => fi.Length);

if (totalLength / 1000000 >= size)
    return true;

Sample 2

 var sizeOfHtmlDirectory = Directory.GetFiles(path, "*.*", SearchOption.AllDirectories);
 long totalLength = 0;
 foreach (var file in sizeOfHtmlDirectory)
 {
     totalLength += new FileInfo(file).Length;
     if (totalLength / 1000000 >= size)
         return true;
 }

Both samples work.

Sample 1 complete in a massivly faster time. I've not timed this accurately but on my PC, using the same folder with the same content/file sizes, Sample 1 takes a few seconds, Sample 2 takes a few minutes.

EDIT

I should point out, the bottle neck in Sample 2 is within the foreach loop! It reads the GetFiles quickly and enters the foreach loop quickly.

My question is, how do I find out why this is the case?

MyDaftQuestions
  • 4,487
  • 17
  • 63
  • 120
  • It could be because with `GetFiles` you first have to enumerate all the files before returning a single result. Try adding a `ToArray()` before the `.Sum` – xanatos Apr 22 '15 at 14:09
  • And you could even try `Directory.EnumerateFiles`/`DirectorInfo.GetFiles` – xanatos Apr 22 '15 at 14:11
  • 1
    Have you also compared it with the approach that you use a `DirectoryInfo` as root and `dirInfo.GetFiles` to get all `FileInfo` objects? – Tim Schmelter Apr 22 '15 at 14:13
  • 2
    Disk access order is the problem here. With EnumerateFiles() you read the Length property at the same time the file name was generated. The disk reader head is still located at the directory entry and Length is readily available. With GetFiles() you *first* generate all the names and *then* need send the disk back to find the file again and obtain the Length property. The extra disk seeks and reads that generates when the file info does not fit in the file system cache are expensive. – Hans Passant Apr 22 '15 at 14:29

2 Answers2

9

Contrary to what the other answers indicate the main difference is not EnumerateFiles vs GetFiles - it's DirectoryInfo vs Directory - in the latter case you only have strings and have to create new FileInfo instances separately which is very costly.

DirectoryInfo returns FileInfo instances that use cached information vs directly creating new FileInfo instances which does not - more details here and here.

Relevant quote (via "The Old New Thing"):

In NTFS, file system metadata is a property not of the directory entry but rather of the file, with some of the metadata replicated into the directory entry as a tweak to improve directory enumeration performance. Functions like Find­First­File report the directory entry, and by putting the metadata that FAT users were accustomed to getting "for free", they could avoid being slower than FAT for directory listings. The directory-enumeration functions report the last-updated metadata, which may not correspond to the actual metadata if the directory entry is stale.

Community
  • 1
  • 1
BrokenGlass
  • 158,293
  • 28
  • 286
  • 335
  • 1
    While i had [the same thought](http://stackoverflow.com/questions/29800121/why-is-enumeratefiles-much-quicker-than-calculating-the-sizes/29800250#comment47728570_29800121), how do you know that `EnumerateFiles` does not need to do the same under the hood? The `FileInfo` instances must also be initialized. Maybe there is some IO overhead if you use `FileInfo(file).Length` because the file must be searched first. – Tim Schmelter Apr 22 '15 at 14:19
  • It could be that... FileInfoResultHandler.CreateObject initializes the FileInfo that must be returned directly by a Win32Native.WIN32_FIND_DATA – xanatos Apr 22 '15 at 14:22
  • @TimSchmelter ultimately, they all end up using the same [FileSystemEnumerator](http://referencesource.microsoft.com/#mscorlib/system/io/filesystemenumerable.cs,e9aaa9fc3bf05462) class. Therefore, the bottleneck most likely is the fact that an extra call to `FileInfo` during the iteration. – James Apr 22 '15 at 14:23
  • It's the caching layer used by the OS - see the quote and the full article by Raymond Chen - it's NOT just the extra call @James – BrokenGlass Apr 22 '15 at 14:24
-2

EnumerateFiles is asynchronous whereas GetFiles waits until all files have been enumerated before returning the collection of files. This will have a big effect on your result.

James Lucas
  • 2,452
  • 10
  • 15