Compare list to itself with parallel execution

Question

i have following code I used up until now to compare a list of file-entrys to itsef by hash-codes

    for (int i = 0; i < fileLists.SourceFileListBefore.Count; i++) // Compare SourceFileList-Files to themselves
            {
                for (int n = i + 1; n < fileLists.SourceFileListBefore.Count; n++) // Don´t need to do the same comparison twice!
                {
                    if (fileLists.SourceFileListBefore[i].targetNode.IsFile && fileLists.SourceFileListBefore[n].targetNode.IsFile)
                        if (fileLists.SourceFileListBefore[i].hash == fileLists.SourceFileListBefore[n].hash)
                        {
                            // do Something
                        }
                }
            }

where SourceFileListBefore is a List

I want to change this code to be able to execute parallel on multiple cores. I thought about doing this with PLINQ, but im completely new to LINQ.

I tried

      var duplicate = from entry in fileLists.SourceFileListBefore.AsParallel()
                            where fileLists.SourceFileListBefore.Any(x => (x.hash == entry.hash) && (x.targetNode.IsFile) && (entry.targetNode.IsFile))
                            select entry;

but it wont work like this, because I have to execute code for each pair of two hash-code matching entrys. So I would at least have to get a collection of results with x+entry from LINQ, not just one entry. Is that possible with PLINQ?

`I used up until now to compare a list of file-entrys to itsef by hash-codes` What is your real intension? Seems to me an [XY-problem](http://www.perlmonks.org/?node=xy+problem) — EZI, Mar 14 '15 at 23:05
Why are you doing this in parallel? Are you computing the hashes on the fly? If not then your list would have to be huge for it to make a difference. — Enigmativity, Mar 14 '15 at 23:09
It might be a list of 100k files so it can take a while. I have to compute the list immediately for my program. — Ich, Mar 14 '15 at 23:16

Michal Ciechan · Accepted Answer · 2015-03-15T01:07:00.033

Why don't you look at optimising your code first?

looking at this statement:

if (fileLists.SourceFileListBefore[i].targetNode.IsFile && fileLists.SourceFileListBefore[n].targetNode.IsFile)

Means you can straight away build1 single list of files where IsFile == true (making the loop smaller already)

secondly,

if (fileLists.SourceFileListBefore[i].hash == fileLists.SourceFileListBefore[n].hash)

Why don't you build a hash lookup of the hashes first.

Then iterate over your filtered list, looking up in the lookup you created, if it contains > 1, it means there is a match as (current node hash + some other node hash). So you only do some work on the matched hashes which is not your node.

I wrote a blog post about it which you can read at @ CodePERF[dot]NET -.NET Nested Loops vs Hash Lookups

PLINQ will only be slightly improving a bad solution to your problem.

Added some comparisons:

Total File Count: 16900
TargetNode.IsFile == true: 11900
Files with Duplicate Hashes = 10000 (5000 unique hashes)
Files with triplicate Hashes = 900 (300 unique hashes)
Files with Unique hashes = 1000

And the actual setup method:

    [SetUp]
    public void TestStup()
    {
        _sw = new Stopwatch();
        _files = new List<File>();
        int duplicateHashes = 10000;
        int triplicateHashesCount = 900;
        int randomCount = 1000;
        int nonFileCount = 5000;

        for (int i = 0; i < duplicateHashes; i++)
        {
            var hash = i % (duplicateHashes / 2);
            _files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = true}});
        }
        for (int i = 0; i < triplicateHashesCount; i++)
        {
            var hash = int.MaxValue - 100000 - i % (triplicateHashesCount / 3);
            _files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = true}});
        }

        for (int i = 0; i < randomCount; i++)
        {
            var hash = int.MaxValue - i;
            _files.Add(new File { Id = i, Hash = hash.ToString(), TargetNode = new Node { IsFile = true } });
        }

        for (int i = 0; i < nonFileCount; i++)
        {
            var hash = i % (nonFileCount / 2);
            _files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = false}});
        }
        _matched = 0;
    }

Than your current method:

    [Test]
    public void FindDuplicates()
    {
        _sw.Start();

        for (int i = 0; i < _files.Count; i++) // Compare SourceFileList-Files to themselves
        {
            for (int n = i + 1; n < _files.Count; n++) // Don´t need to do the same comparison twice!
            {
                if (_files[i].TargetNode.IsFile && _files[n].TargetNode.IsFile)
                    if (_files[i].Hash == _files[n].Hash)
                    {
                        // Do Work
                        _matched++;
                    }
            }
        }

        _sw.Stop();
    }

Takes around 7.1 seconds on my machine.

Using lookup to find hashes which appear multiple times takes 21ms.

    [Test]
    public void FindDuplicatesHash()
    {
        _sw.Start();

        var lookup = _files.Where(f => f.TargetNode.IsFile).ToLookup(f => f.Hash);

        foreach (var duplicateFiles in lookup.Where(files => files.Count() > 1))
        {
            // Do Work for each unique hash, which appears multiple times in _files.

            // If you need to do work on each pair, you will need to create pairs from duplicateFiles
            // this can be an excercise for you ;-)
            _matched++;
        }

        _sw.Stop();
    }

In my test, using PLINQ for counting the lookups, is actually slower (As there is a large cost of dividing lists between threads and aggregating results back)

    [Test]
    public void FindDuplicatesHashParallel()
    {
        _sw.Start();

        var lookup = _files.Where(f => f.TargetNode.IsFile).ToLookup(f => f.Hash);

        _matched = lookup.AsParallel().Where(g => g.Count() > 1).Sum(g => 1);

        _sw.Stop();
    }

This took 120ms, so almost 6 times as long with my current source list.

Compare list to itself with parallel execution

1 Answers1