0

I am currently coding on a personal project for sneaking throug binary trees and searching for new or changed files. I want to save all files, my search has found with path and md5 checksum into a csv file for the comparison afterwords. The files are loaded in a IEnumerable Variable as Objects of my own Class iFile. But the writing of the csv file takes about 5min for just 15.000 files.(1min and 6sec for processing the IEnumerable to List) Is there a way to speed up my code?

This is my recursive search:

public static IEnumerable<iFile> GetAllFiles(string root, bool ignoreUnauthorizedAccess = true)
    {
        Stack<string> stack = new Stack<string>();

        stack.Push(root);
        while (stack.Count > 0)
        {
            string curDir = stack.Pop();
            string[] files = null;
            try
            {
                files = Directory.GetFiles(curDir);
            }
            catch (UnauthorizedAccessException)
            {
                if (!ignoreUnauthorizedAccess) throw;
            }

            catch (IOException)
            {
                if (!ignoreUnauthorizedAccess) throw;
            }

            if (files != null)
                foreach (string file in files)
                {
                    iFile f = new iFile(new FileInfo(file));
                    yield return f;
                }


            string[] dirs = null;
            try
            {
                dirs = Directory.GetDirectories(curDir);
            }
            catch (UnauthorizedAccessException)
            {
                if (!ignoreUnauthorizedAccess) throw;
            }

            catch (IOException)
            {
                if (!ignoreUnauthorizedAccess) throw;
            }

            if (dirs != null)
                foreach (string dir in dirs)
                    stack.Push(dir);
        }
    }

This is my writing function:

private static void writeToSystem<iFile>(this IEnumerable<iFile> files, string path = "c:\")
    {

        using (System.IO.StreamWriter f = new System.IO.StreamWriter(path))
            {
                foreach (var i in files)
                {
                    f.WriteLine(i.getPath() + ";" + i.getHash());
                }
            }

    }

And the getHash function from the iFile class:

using (var md5 = new MD5CryptoServiceProvider())
            {

                if(File.Exists(@filename) && fInfo.Length < 100000 ){
                    try
                    {
                        byte[] data = md5.ComputeHash(Encoding.Default.GetBytes(filename),0,2000);
                        return BitConverter.ToString(data);
                    }
                    catch (Exception)
                    {
                        Program.logger.log("Fehler beim MD5 erstellen!", Program.logger.LOG_ERROR);
                        return "";
                    }
                } else {
                    return "";
                }
            }
Tobi
  • 654
  • 3
  • 13
  • 1
    `Is there a way to speed up my code` yes just change your disk to more modern SSD and use multithreading for your files with Tasks – BRAHIM Kamel May 09 '17 at 13:32
  • 1
    Note that `GetAllFiles` only seems to be fast because of the `yield return`. It's not actually doing the entire search before returning the `IEnumerable`. The code to search the filesystem gets run incrementally as you enumerate through things. – adv12 May 09 '17 at 13:37
  • You may try to build a string and after write once with, say, [File.WriteAllText](https://msdn.microsoft.com/en-us/library/ms143375(v=vs.110).aspx). Use `StringBuilder` in that case. – Tigran May 09 '17 at 13:38
  • I wrote the wrong function, updated my post! @BRAHIMKamel: I have a decent SSD in my pc, so thats not the point. I already thought on multithreading, but splitting the list in parts will take longer.. – Tobi May 09 '17 at 13:39
  • @adv12 So what are you suggesting to do now? I am a little bit lost.. Trying to optimize my iFile class, where the md5 sum is generated. – Tobi May 09 '17 at 13:42
  • 1
    @Tobi, I'm guessing that the bulk of your time is spent doing the I/O of walking the filesystem in `GetAllFiles`. I'm also guessing that there's not much you can do to speed it up. You can optimize all the non-I/O work, but I would guess that the CPU time is dwarfed by the I/O time. – adv12 May 09 '17 at 13:44
  • @adv12 Ok thanks, but how are other programs handling this kind of issues. Some of them are going through directories in seconds with 100.000 + files and directories in it? – Tobi May 09 '17 at 13:47
  • First of all, you should analyze the time taken by `GetAllFiles(...).ToList()` because only then you know how long it takes to just enumerate all files and convert them to `iFile`. The `GetAllFiles(...)` call itself will only give you the time to enumerate the current directory without any sub-directories. When we know the *actual* enumeration time, we can talk about enumeration vs processing time and about optimizations. – grek40 May 09 '17 at 13:50
  • @Tobi, I have no idea. Who knows? Maybe you're right and you can speed things up by optimizing your md5 generation. If you want to know how long the search is taking as things are now, use a stopwatch to time `files.ToList()`. `ToList` will enumerate over everything. Then you can try optimizing your calculations and see if it has any significant effect. – adv12 May 09 '17 at 13:51
  • @grek40, jinx, you owe me a Coke. – adv12 May 09 '17 at 13:52
  • @adv12 there you go *hands over a virtual coke and some supposedly yummy cookies (if they ever really existed)* – grek40 May 09 '17 at 13:55
  • @grek40 Process time for `files.ToList()` with 15.000 files was 1min and 6sec – Tobi May 09 '17 at 13:57
  • @Tobi better edit that information into the question. Now at least you know that you need roughly 1/5 of your time just touching each of the files once. You could try if you get improved performance when you create a parallel task for processing (enumerate + calculate path and hash) the files in each sub directory, but I don't know whether there is actually much improvement potential. – grek40 May 09 '17 at 14:01
  • @grek40 But while the IEnumerable is converted into a list, the hash and path are processed. So there would be no opportunity to do parallel tasks. – Tobi May 09 '17 at 14:06
  • @Tobi No, you could still do so. The real problem is that, unless you have multiple hard drives, you're not going to get much out of querying them in parallel. You might get a *bit*, but likely not much. – Servy May 09 '17 at 14:14
  • Depending on what exactly happens there, the definition of `iFile.getPath` and `iFile.getHash` might be important. In case you compute any non-trivial thing from the file contents, nonblocking IO can improve the processing throughput. – grek40 May 09 '17 at 14:44
  • @grek40 I added my getHash function, found out thats the "speedkiller" – Tobi May 16 '17 at 15:07
  • You compute the MD5 value of the byte-encoded filename? Maybe I just misinterpreted the code, but it looks like it and I can't really wrap my head around the idea behind this move. – grek40 May 16 '17 at 15:17
  • @grek40 No, I am creating a hash out of the first 2000 bytes of a file. – Tobi May 22 '17 at 12:46
  • @Tobi __are you sure?__ To me it looks like you call the following overload: https://msdn.microsoft.com/en-us/library/ds4kkd55(v=vs.110).aspx and it is not expecting a filename. – grek40 May 22 '17 at 12:57
  • @grek40 ** BUT ** Even if it is like you mentioned, it shouldn't take that long gernating hash values out of filenames.. So there is a pretty damn leak of performance, isn't it? – Tobi May 23 '17 at 12:46
  • @Tobi definitely, the fact that you don't actually create the hash from file contents makes the performance impact more interesting. I just don't feel like going into detail about optimizing something that is not even intended to exist. SO first you should worry about getting the correct MD5 from the file content and look at new performance, then worry about optimization. – grek40 May 23 '17 at 12:48
  • @grek40 I opened a new question thread for this topic with a different code snippte which took the identical time. Go here [CLICK](https://stackoverflow.com/questions/44135432/performance-issues-while-creating-file-checksums) – Tobi May 23 '17 at 12:54

2 Answers2

0

I think your getPath() and getHash() are time consuming one.

i.getPath() + ";" + i.getHash()

nithinmohantk
  • 459
  • 4
  • 11
0

In order to parallelize your workload, you have to restructure your code.

The following approach combines a sequential directory traversal with a parallel task of processing the files within each directory. So different directories will be inspected in parallel, but all files within one directory will be processed sequentially again within this task. This may be suitable for a structure with sub-directories where each directory contains not so many files. If one directory contains a large amount of files or if there are many directories with a small amount of files each, a different parallelization might be necessary.

public static async Task<IEnumerable<string>> ProcessAllFiles(string root, Func<iFile, string> fileToLineConverter, bool ignoreUnauthorizedAccess = true)
{
    Stack<string> stack = new Stack<string>();
    List<Task<IEnumerable<string>>> resultTasks = new List<Task<IEnumerable<string>>>();


    stack.Push(root);
    while (stack.Count > 0)
    {
        string curDir = stack.Pop();

        resultTasks.Add(Task.Run(() => ProcessFilesInDirectory(curDir, fileToLineConverter, ignoreUnauthorizedAccess)));

        string[] dirs = null;
        try
        {
            dirs = Directory.GetDirectories(curDir);
        }
        catch (UnauthorizedAccessException)
        {
            if (!ignoreUnauthorizedAccess) throw;
        }
        catch (IOException)
        {
            if (!ignoreUnauthorizedAccess) throw;
        }

        if (dirs != null)
            foreach (string dir in dirs)
                stack.Push(dir);
    }

    var results = await Task.WhenAll(resultTasks);
    return results.SelectMany(x => x);
}

private static IEnumerable<string> ProcessFilesInDirectory(string curDir, Func<iFile, string> fileToLineConverter, bool ignoreUnauthorizedAccess)
{
    FileInfo[] files = null;
    try
    {
        var dir = new DirectoryInfo(curDir);
        files = dir.GetFiles();
    }
    catch (UnauthorizedAccessException)
    {
        if (!ignoreUnauthorizedAccess) throw;
    }

    if (files != null)
        return files.Select(x => fileToLineConverter(new iFile(x))).ToList();

    return Enumerable.Empty<string>();
}


async Task ExecuteFull(string path)
{
    var lines = await ProcessAllFiles(
        @"C:\",
        x => x.getPath() + ";" + x.getHash(),
        false);


    using (System.IO.StreamWriter f = new System.IO.StreamWriter(path))
    {
        foreach (var i in lines)
        {
            f.WriteLine(lines);
        }
    }
}
grek40
  • 13,113
  • 1
  • 24
  • 50
  • The operation is pretty much entirely IO bound. There isn't going to be much point in parallelizing operations that are just going to be serialized by the hard drive. – Servy May 09 '17 at 14:49
  • @Servy might be true, depending on the `getHash` function mostly... if it does any significant work on the file content, then some parallel or nonblocking work will improve performance. Always remember IO is not just slow, but actually capable of transferring large amounts into memory at once - so the crucial part is not querying every tiny IO piece at the time when it is needed but instead to prefetch some work. – grek40 May 09 '17 at 14:52
  • As the question mentions, it's computing an MD5 checksum on the file contents. Computing the checksum will be very fast, but it'll need to read the entire file's contents to compute it, which will be slow. – Servy May 09 '17 at 14:54
  • Thanks for this piece of code, but its not improving the speed of the process. Found out that the getHash function takes most of the time, without it the code took only 14sec for everything with ~70.000 files! – Tobi May 16 '17 at 15:16