9

Good morning all, I have a folder which contains thousands of subdirectories at different depths. I need to list all of the directories which don't contain subdirectories (the proverbial "end of the line"). It's fine if they contain files. Is there a way to do this with EnumerateDirectories?

For example, if a fully recursive EnumerateDirectories returned:

/files/
/files/q
/files/q/1
/files/q/2
/files/q/2/examples
/files/7
/files/7/eb
/files/7/eb/s
/files/7/eb/s/t

I'm only interested in:

/files/q/1
/files/q/2/examples
/files/7/eb/s/t
BuZz
  • 16,318
  • 31
  • 86
  • 141
Sam2S
  • 165
  • 1
  • 6

2 Answers2

18

This should work:

var folderWithoutSubfolder = Directory.EnumerateDirectories(root, "*.*", SearchOption.AllDirectories)
     .Where(f => !Directory.EnumerateDirectories(f, "*.*", SearchOption.TopDirectoryOnly).Any());
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
3

If you want to avoid calling EnumerateDirectories() twice for each directory, you can implement it like so:

public IEnumerable<string> EnumerateLeafFolders(string root)
{
    bool anySubfolders = false;

    foreach (var subfolder in Directory.EnumerateDirectories(root))
    {
        anySubfolders = true;

        foreach (var leafFolder in EnumerateLeafFolders(subfolder))
            yield return leafFolder;
    }

    if (!anySubfolders)
        yield return root;
}

I did some timing tests, and for me this approach is more than twice as fast as using the Linq approach.

I ran this test using a release build, run outside of any debugger. I ran it on an SSD containing a large number of folders - the total number of LEAF folders was 25035.

My results for the SECOND run of the program (the first run was to preheat the disk cache):

Calling Using linq.  1 times took 00:00:08.2707813
Calling Using yield. 1 times took 00:00:03.6457477
Calling Using linq.  1 times took 00:00:08.0668787
Calling Using yield. 1 times took 00:00:03.5960438
Calling Using linq.  1 times took 00:00:08.1501002
Calling Using yield. 1 times took 00:00:03.6589386
Calling Using linq.  1 times took 00:00:08.1325582
Calling Using yield. 1 times took 00:00:03.6563730
Calling Using linq.  1 times took 00:00:07.9994754
Calling Using yield. 1 times took 00:00:03.5616040
Calling Using linq.  1 times took 00:00:08.0803573
Calling Using yield. 1 times took 00:00:03.5892681
Calling Using linq.  1 times took 00:00:08.1216921
Calling Using yield. 1 times took 00:00:03.6571429
Calling Using linq.  1 times took 00:00:08.1437973
Calling Using yield. 1 times took 00:00:03.6606362
Calling Using linq.  1 times took 00:00:08.0058955
Calling Using yield. 1 times took 00:00:03.6477621
Calling Using linq.  1 times took 00:00:08.1084669
Calling Using yield. 1 times took 00:00:03.5875057

As you can see, using the yield approach is significantly faster. (Probably because it doesn't enumerate each folder twice.)

My test code:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;

namespace Demo
{
    class Program
    {
        private void run()
        {
            string root = "F:\\TFROOT";

            Action test1 = () => leafFolders1(root).Count();
            Action test2 = () => leafFolders2(root).Count();

            for (int i = 0; i < 10; ++i)
            {
                test1.TimeThis("Using linq.");
                test2.TimeThis("Using yield.");
            }
        }

        static void Main()
        {
            new Program().run();
        }

        static IEnumerable<string> leafFolders1(string root)
        {
            var folderWithoutSubfolder = Directory.EnumerateDirectories(root, "*.*", SearchOption.AllDirectories)
                 .Where(f => !Directory.EnumerateDirectories(f, "*.*", SearchOption.TopDirectoryOnly).Any());

            return folderWithoutSubfolder;
        }

        static IEnumerable<string> leafFolders2(string root)
        {
            bool anySubfolders = false;

            foreach (var subfolder in Directory.EnumerateDirectories(root))
            {
                anySubfolders = true;

                foreach (var leafFolder in leafFolders2(subfolder))
                    yield return leafFolder;
            }

            if (!anySubfolders)
                yield return root;
        }
    }

    static class DemoUtil
    {
        public static void Print(this object self)
        {
            Console.WriteLine(self);
        }

        public static void Print(this string self)
        {
            Console.WriteLine(self);
        }

        public static void Print<T>(this IEnumerable<T> self)
        {
            foreach (var item in self)
                Console.WriteLine(item);
        }

        public static void TimeThis(this Action action, string title, int count = 1)
        {
            var sw = Stopwatch.StartNew();

            for (int i = 0; i < count; ++i)
                action();

            Console.WriteLine("Calling {0} {1} times took {2}",  title, count, sw.Elapsed);
        }
    }
}
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • EnumerateDirectories is lazily evaluated, so the extra call made in Tim's answer is pretty cheap. When I benchmarked your code against Tim's, Tim's ran in slightly less than half the time. I imagine this is because using an iterator recursively adds a lot of overhead. – Brian Jul 24 '13 at 13:36
  • @Brian Did you run the test multiple times to eliminate artifacts caused by disk caching the first run through? – Matthew Watson Jul 24 '13 at 13:40
  • Yes. I ran the tests a couple hundred times before I started, compiled with optimizations, and ran each test once (to avoid jitter artifacts) before starting the timers. – Brian Jul 24 '13 at 13:47
  • @Brian I'm seeing my method working more than twice as fast. I'll attach my test code so you can see it. Did you make sure to run it OUTSIDE the debugger (otherwise it'll run in debug mode even if its compiled as release) – Matthew Watson Jul 24 '13 at 13:54
  • I tried altering your code to not use recursion (see http://pastebin.com/eZdbM0i6 ). When I benchmarked it, it ran about 10% faster than Tim's. These benefits mostly did not vanish if I added an extra call to Directory.EnumerateDirectories, so I think they relate to LINQ overhead. Anyhow, it's possible that extraneous calls to EnumerateDirectories are more expensive on your system. – Brian Jul 24 '13 at 13:55
  • @Brian Did you try the exact code I posted above (but changing the root path to something appropriate of course)? My system is not unusual (Windows 8.0 x64), and I see similar results on other systems. – Matthew Watson Jul 24 '13 at 14:05
  • I must have done something wrong, because yours is faster. And my change provided (I posted the wrong url - see http://pastebin.com/E7k2xgcW) negligible benefit. – Brian Jul 24 '13 at 14:09