4

Using System.IO.Directory.GetFiles(), I would like to find images .png extension located on NAS server.

string searchingString = "ZLLK9";
// original
var fileList1= Directory.GetFiles(directoryPath).Select(p => new FileInfo(p)).Where(q => q.Name.Substring(0, q.Name.LastIndexOf('.')).Split('_').First() == searchingString);
// fixed    
var fileList2 = Directory.GetFiles(directoryPath, string.Format("{0}_*.png", searchingString));

There are two ways to find out files contain "ZLLKK9" words.

The first 'original' way using LINQ is too slow to find out the files. The performance issues are up but I don't know what is different with 'fixed' way?

I need help for understanding the difference with two ways carefully.

Simon MᶜKenzie
  • 8,344
  • 13
  • 50
  • 77
FragrantJH
  • 95
  • 2
  • 9
  • 3
    Why do you think there is a significant difference between the two approaches? – Tim May 12 '15 at 02:07
  • if you are trying to find out if the files contain `.png` Extension then why not do something easier `var fi = new DirectoryInfo(directoryPath).GetFiles().Where(f => (f.FullName.EndsWith(".png"))).ToArray();` – MethodMan May 12 '15 at 02:09

3 Answers3

8

The first way is slow for 2 reasons:

  • You're constructing a FileInfo object for each file. There's no need for this if all you want is the file name. Constructing a FileInfo is relatively light, but it's unnecessary and all the instantiations will slow you down if you're querying a lot of files. Since all you really need is the file's name, you can do without this extra step.

  • The LINQ approach retrieves everything, then filters afterwards. It's much more efficient (and faster) to get the file system to do the filtering for you.

If you still want to use LINQ, here's a more performant version of your query, which cuts out a lot of enumeration and string manipulation:

var fileList1 = Directory.GetFiles(directoryPath).Where(
    path => Regex.IsMatch(Path.GetFileName(path), @"^ZLLK9_.*\.png$"));
Simon MᶜKenzie
  • 8,344
  • 13
  • 50
  • 77
  • Yep, asker can verify this by mousing over each `var` and seeing what the actual return is. – AaronLS May 12 '15 at 02:12
  • 3
    For performance reasons, it's generally much better to provide the filename pattern to Directory.GetFiles() rather than filter after-the-fact like this answer does. – reuben May 12 '15 at 02:13
  • I'm with you, @reuben! I was just explaining the differences between the 2 approaches. – Simon MᶜKenzie May 12 '15 at 02:14
  • 1
    @SimonMᶜKenzie But there are *two* differences. One difference is the unnecessary construction of `FileInfo` objects; the other is filtering on the client vs. filtering on the server. Both are useful to understand. – reuben May 12 '15 at 02:15
  • @reuben. True. I will update my answer to include your point. – Simon MᶜKenzie May 12 '15 at 02:16
  • Are you sure about it loading the metadata for each file? I thought it was lazy until you actually accessed a property that needed to be read from disk. – Enigmativity May 12 '15 at 02:30
  • 1
    @SimonMᶜKenzie Thanks a lot ;-) I think I don't need to use LINQ for searching only file names. – FragrantJH May 12 '15 at 02:30
  • @Enigmativity, you're right. It doesn't query all the metadata - all it does is demand read permission for the file, so there's still a filesystem hit, but not as big as I thought it was! – Simon MᶜKenzie May 12 '15 at 03:01
  • @SimonMᶜKenzie - I'm not even sure that there is a file system hit. I think the read permission it is looking for is the .NET run-time security permission. – Enigmativity May 12 '15 at 03:14
  • @Enigmativity, once again, you're right. I did some tests with [procmon](https://technet.microsoft.com/library/bb896645.aspx), and although creating a `FileInfo` does hit the filesystem for some files (*.doc and *.docx for me - maybe it's an indexing thing), it doesn't do anything for the majority of files. I'll update my answer again! – Simon MᶜKenzie May 12 '15 at 05:44
4

The 1st one is get all the files object in that directory and afterward doing the query to find the name.

The 2nd one is to only return files with the name using windows internal API which is much faster than c# method ( LINQ ).

The different in performance more in one utilize internal API which is faster than C# code.

kwangsa
  • 1,701
  • 12
  • 16
2

The answer lies in the way you use GetFiles().

Your original solutions gets all files from a directory. Your software then iterates through them to find the correct pattern. Documentation here: Directory.GetFiles Method (String).

Your fixed version uses a different .NET Framework method which is Directory.GetFiles Method (String, String). The second parameter is a search pattern. Filtering the files happens here not by your self-written code (LINQ), but by the underlying operating system itself.

Quality Catalyst
  • 6,531
  • 8
  • 38
  • 62
  • 1
    Regarding your second point, I think it'd be more accurate to say that filtering the files is done by the operating system itself, not by the framework. – Simon MᶜKenzie May 12 '15 at 05:51