-1

I'm trying to write a piece of code getting a list of files which are present in a folder structure but not present in another folder structure. In other words, I have to archive files only if they are not already archived. The two rootfolders are network folders potentially used by around 1000 users.

In my first attempt, I used VBA from an Excel workbook. I used a shell to get the list and then "Dir" command to check their presence in the other folder.

Here a part of my code:

UPDATE - Here my code:

Dim arr
Dim i As Integer
Dim sFolder As String
Dim sYear As String
Dim sDecade As String
Dim sGFolder As String
Dim sDestXLF As String

sFolder = "T:\FirstRootFolder\"

arr = Filter(Split(CreateObject("wscript.shell").exec("cmd /c Dir ""T:\FirstRootFolder\xx*"" /b /ad /on").StdOut.ReadAll, vbCrLf), ".")

For i = 1 To UBound(arr)
  If Dir(sFolder & arr(i) & "\*.xlf") <> "" Then
    sYear = Right(arr(i), 2)
    sDecade = Mid(arr(i), 3, 2)
    sGFolder = "G:\SecondFolderRoot\" & sYear & "\xx\xx" & sDecade & "\"
    sDestXLF = sGFolder & arr(i) & ".it.xlf"
    If Dir(sDestXLF) = vbNullString Then
      ListBox1.AddItem arr(i)
    End If
  End If
Next i

It works fine and it takes around 6 seconds to complete.

Now, I'm learning framework and C# so I tried to do the same without opening Excel. I tried different solutions but using shell commands (GetFiles, FileExists, ecc) and I end up with this:

string FirstFolderRoot, SecondFolderRoot;
lboLista.DataSource = null;

FirstFolderRoot = @"T:\FirstRootFolder\";
SecondFolderRoot = @"G:\SecondFolderRoot\";

var foundFilesFirst = Directory.EnumerateFiles(FirstFolderRoot, "*.xlf", SearchOption.AllDirectories)
                     .Where(s => s.Contains("xx"))
                     .Select(m => Path.GetFileNameWithoutExtension(Path.GetFileNameWithoutExtension(m))).ToArray();
var foundFilesSecond = Directory.EnumerateFiles(SecondFolderRoot, "*.xlf", SearchOption.AllDirectories)
                     .Select(m => Path.GetFileNameWithoutExtension(Path.GetFileNameWithoutExtension(m))).ToArray();

var foundFiles = foundFilesFirst.Except(foundFilesSecond);
lboLista.DataSource = foundFiles.ToList();

It works fine too but it take around 1 minute and half to complete, with most of the time spent filling the two collections.

Is there a way to have comparable performances to VBA? Are really shell commands so faster than framework or it's me not using it in the right way?

I read that the faster way would be using winapi but I really wish to use framework.

Marco
  • 17
  • 4
  • Possible duplicate here : http://stackoverflow.com/questions/7596747/c-sharp-how-to-list-the-files-in-a-sub-directory-fast-optimised-way – Ahmad Mar 23 '15 at 11:54
  • maybe look up `Directory.GetFiles(strPath).Where(f => f.EndsWith(".xlf"))` to build the collection.... why do you need to enumerateFiles? –  Mar 23 '15 at 12:26
  • @Ahmad I have already read that post. It suggests to use GetFiles but, as I read somewhere else, EnumerateFiles is supposed to be faster because elements are processed before the collection is completely filled. – Marco Mar 23 '15 at 12:44
  • @Michal Krzych (see comment above) EnumerateFiles is supposed to be faster then GetFiles because elements are processed before the collection is completely filled. – Marco Mar 23 '15 at 12:46
  • @Marco you don't gain any benefit from `EnumerateFiles` because you are using `ToArray()` which will enumerate them anyway. The advantage of `EnumerateFiles()` is to allow you to begin using the files (eg in a for-loop) while they're still being populated in the background. – Chris L Mar 23 '15 at 13:24
  • It's quite strange that VBA takes 6 seconds but .NET takes 90 seconds... Something is definitely not right –  Mar 23 '15 at 13:26
  • @ChrisL Thanks Chris for clarifing that point. Any ideas on how to write code to gain benefit from `EnumerateFiles()`? My `select` is there to extract filename from complete path. – Marco Mar 23 '15 at 14:05
  • @MichalKrzych Actually VBA is just calling shell command. I can try to do the same in C# but I'd like to use native framework possibilities. – Marco Mar 23 '15 at 14:06
  • 1
    @Marco Can you include the exact code you're running for each scenario - it's likely you're doing something differently in VBA to your c#. The time difference is too great to just be the coding methods you're using. Possible network usage at the point of running the tests, bug(s) in either method resulting in incorrect output etc. – Chris L Mar 23 '15 at 14:25
  • 1
    You are comparing apples and oranges, your VBA code filters `xx*` and your .NET code filters `*.xlf`. Apparently there are a lot less files whose name start with "xx", not surprising. – Hans Passant Mar 23 '15 at 15:15
  • @ChrisL How can I include my full code? Here in comments? Sorry but I'm new to this community. – Marco Mar 23 '15 at 15:53
  • You can edit your question and include more code. – Chris L Mar 23 '15 at 15:55
  • @HansPassant I removed all string manipulations from the code I posted. As far as I know I/O is hundreds of times slower then string manipulation. I can assure both codes return same set of filenames. Anyway, you are right, my question is "Are really shell commands (apples) so faster than framework (oranges) or it's me not using it in the right way?" – Marco Mar 23 '15 at 15:57
  • @ChrisL Full code added. Let me know if you need to know full folders structure. – Marco Mar 23 '15 at 16:42

1 Answers1

0

Take a look at the example here: msdn link for an example of directory and file comparisons in C#.NET. It makes use of IEqualityComparer() and SequenceEqual()

The bottleneck will be the time taken to run the two GetFiles() commands. If these are different network drives these can be run in separate threads for increased performance.

For clarity here's a slightly modified excerpt:

// This implementation defines a very simple comparison 
// between two FileInfo objects. It only compares the name 
// of the files being compared and their length in bytes. 
class FileCompare : System.Collections.Generic.IEqualityComparer<System.IO.FileInfo>
{
    public FileCompare() { }

    public bool Equals(System.IO.FileInfo f1, System.IO.FileInfo f2)
    {
        return (f1.Name == f2.Name &&
                f1.Length == f2.Length);
    }

    // Return a hash that reflects the comparison criteria. According to the  
    // rules for IEqualityComparer<T>, if Equals is true, then the hash codes must 
    // also be equal. Because equality as defined here is a simple value equality, not 
    // reference identity, it is possible that two or more objects will produce the same 
    // hash code. 
    public int GetHashCode(System.IO.FileInfo fi)
    {
        string s = String.Format("{0}{1}", fi.Name, fi.Length);
        return s.GetHashCode();
    }
}


class CompareDirs
{

    static void Main(string[] args)
    {
        // Create two identical or different temporary folders  
        // on a local drive and change these file paths. 
        string pathA = @"C:\TestDir";
        string pathB = @"C:\TestDir2";

        System.IO.DirectoryInfo dir1 = new System.IO.DirectoryInfo(pathA);
        System.IO.DirectoryInfo dir2 = new System.IO.DirectoryInfo(pathB);

        // Take a snapshot of the file system.
        IEnumerable<System.IO.FileInfo> list1 = dir1.GetFiles("*.*", System.IO.SearchOption.AllDirectories);
        IEnumerable<System.IO.FileInfo> list2 = dir2.GetFiles("*.*", System.IO.SearchOption.AllDirectories);

        //A custom file comparer defined below
        //Use Except to retrieve files from list2 not in list1
        var uniqueFiles = list1.Except(list2, new FileCompare());
   }

}
Chris L
  • 2,262
  • 1
  • 18
  • 33
  • Do you think that those two `GetFiles()` are faster than my code? I'll give a try and I'll let you know. – Marco Mar 23 '15 at 22:00
  • With your solution it takes 45 seconds to complete, a nice 50% reduction. Anyway it's still 7x slower than shell command. I'm starting to suspect that GetFiles/EnumerateFiles uses many more I/O requests than dir command. – Marco Mar 24 '15 at 09:59
  • I'm not fully up to scratch with the internals of the windows OS, but it's possible that there's some behind-the-scenes caching or something at work when using `dir`? I'm still pretty certain it won't take 40seconds longer under exactly the same conditions. – Chris L Mar 24 '15 at 14:36