Why don't you look at optimising your code first?
looking at this statement:
if (fileLists.SourceFileListBefore[i].targetNode.IsFile && fileLists.SourceFileListBefore[n].targetNode.IsFile)
Means you can straight away build1 single list of files where IsFile == true
(making the loop smaller already)
secondly,
if (fileLists.SourceFileListBefore[i].hash == fileLists.SourceFileListBefore[n].hash)
Why don't you build a hash lookup of the hashes first.
Then iterate over your filtered list, looking up in the lookup you created, if it contains > 1, it means there is a match as (current node hash + some other node hash). So you only do some work on the matched hashes which is not your node.
I wrote a blog post about it which you can read at @ CodePERF[dot]NET -.NET Nested Loops vs Hash Lookups
PLINQ will only be slightly improving a bad solution to your problem.
Added some comparisons:
Total File Count: 16900
TargetNode.IsFile == true: 11900
Files with Duplicate Hashes = 10000 (5000 unique hashes)
Files with triplicate Hashes = 900 (300 unique hashes)
Files with Unique hashes = 1000
And the actual setup method:
[SetUp]
public void TestStup()
{
_sw = new Stopwatch();
_files = new List<File>();
int duplicateHashes = 10000;
int triplicateHashesCount = 900;
int randomCount = 1000;
int nonFileCount = 5000;
for (int i = 0; i < duplicateHashes; i++)
{
var hash = i % (duplicateHashes / 2);
_files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = true}});
}
for (int i = 0; i < triplicateHashesCount; i++)
{
var hash = int.MaxValue - 100000 - i % (triplicateHashesCount / 3);
_files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = true}});
}
for (int i = 0; i < randomCount; i++)
{
var hash = int.MaxValue - i;
_files.Add(new File { Id = i, Hash = hash.ToString(), TargetNode = new Node { IsFile = true } });
}
for (int i = 0; i < nonFileCount; i++)
{
var hash = i % (nonFileCount / 2);
_files.Add(new File {Id = i, Hash = hash.ToString(), TargetNode = new Node {IsFile = false}});
}
_matched = 0;
}
Than your current method:
[Test]
public void FindDuplicates()
{
_sw.Start();
for (int i = 0; i < _files.Count; i++) // Compare SourceFileList-Files to themselves
{
for (int n = i + 1; n < _files.Count; n++) // Don´t need to do the same comparison twice!
{
if (_files[i].TargetNode.IsFile && _files[n].TargetNode.IsFile)
if (_files[i].Hash == _files[n].Hash)
{
// Do Work
_matched++;
}
}
}
_sw.Stop();
}
Takes around 7.1 seconds on my machine.
Using lookup to find hashes which appear multiple times takes 21ms.
[Test]
public void FindDuplicatesHash()
{
_sw.Start();
var lookup = _files.Where(f => f.TargetNode.IsFile).ToLookup(f => f.Hash);
foreach (var duplicateFiles in lookup.Where(files => files.Count() > 1))
{
// Do Work for each unique hash, which appears multiple times in _files.
// If you need to do work on each pair, you will need to create pairs from duplicateFiles
// this can be an excercise for you ;-)
_matched++;
}
_sw.Stop();
}
In my test, using PLINQ for counting the lookups, is actually slower (As there is a large cost of dividing lists between threads and aggregating results back)
[Test]
public void FindDuplicatesHashParallel()
{
_sw.Start();
var lookup = _files.Where(f => f.TargetNode.IsFile).ToLookup(f => f.Hash);
_matched = lookup.AsParallel().Where(g => g.Count() > 1).Sum(g => 1);
_sw.Stop();
}
This took 120ms, so almost 6 times as long with my current source list.