2

I'm using Directory.EnumerateFiles to list files in two separate directories. Some of the files exist in both folders. How can I remove any duplicate files from the combined list?

try
{
    corporateFiles = Directory.EnumerateFiles(@"\\" + corporateServer, "*.pdf", SearchOption.AllDirectories).ToList();
}
catch
{
    corporateFiles = new List<string>();
}

try {
    functionalFiles = Directory.EnumerateFiles(@"\\" + functionalServer, "*.pdf", SearchOption.AllDirectories).ToList();
}
catch
{
    functionalFiles = new List<String>();
}
var combinedFiles = corporateFiles.Concat(functionalFiles);
Antarr Byrd
  • 24,863
  • 33
  • 100
  • 188
  • Possible duplicate of [how to merge 2 List with removing duplicate values in C#](http://stackoverflow.com/questions/4031262/how-to-merge-2-listt-with-removing-duplicate-values-in-c-sharp) – Dudemanword Jun 14 '16 at 18:30
  • 3
    Hash every file (SHA1 will do) and store the result in a `HashSet`. When you come across a file with a hash already existing in the set, delete it. – James Buck Jun 14 '16 at 18:30
  • The trick is that you only want to consider the file name and not the entire path when removing duplicates. And then you need to decide which path you want to keep. – juharr Jun 14 '16 at 18:31
  • Which path do you want to keep if there are duplicate file names? – Sonny Childs Jun 14 '16 at 18:36
  • @SonnyChilds If there are duplicates removing the one in functionalfiles should be ok. – Antarr Byrd Jun 14 '16 at 18:42
  • 1
    Be aware. Just because they have the same name does not mean they have the same contents. – paparazzo Jun 14 '16 at 19:08

2 Answers2

2

It seems I cannot satisfy my lust for LINQ.

Here's a one-liner:

var combinedFiles = corporateFiles.Concat(functionalFiles.Where(x => !(corporateFiles.Select(y => y.Split('\\').Last()).ToList().Intersect(functionalFiles.Select(y => y.Split('\\').Last()))).Contains(x.Split('\\').Last())));

This keeps the filepaths from corporateFiles. You can swap them if you prefer otherwise.

I'll attempt to format this to be more readable.

EDIT: Here's the code abstracted out a bit, hopefully more readable:

// Get common file names:
var duplicateFileNames = corporateFiles.Select(y => y.Split('\\').Last()).ToList().Intersect(functionalFiles.Select(y => y.Split('\\').Last()));

// Remove entries in 'functionalFiles' that are duplicates:
var functionalFilesWithoutDuplicates = functionalFiles.Where(x => !duplicateFileNames.Contains(x.Split('\\').Last()));

// Combine the un-touched 'corporateFiles' with the filtered 'functionalFiles':
var combinedFiles = corporateFiles.Concat(functionalFilesWithoutDuplicates);
Sonny Childs
  • 580
  • 2
  • 13
1

Use Union instead of Concat:

var combinedFiles = corporateFiles.Union(functionalFiles);

You can use the overload passing an IEqualityComparer<string> to compare using only the name part:

var combined = corporateFiles.Union(functionalFiles, new FileNameComparer())

class FileNameComparer : EqualityComparer<string>
{
    public override bool Equals(string x, string y)
    {
        var name1 = Path.GetFileName(x);
        var name2 = Path.GetFileName(y);
        return name1 == name2;
    }

    public override int GetHashCode(string obj)
    {
        var name = Path.GetFileName(obj);
        return name.GetHashCode();
    }
}
Arturo Menchaca
  • 15,783
  • 1
  • 29
  • 53
  • 1
    That wouldn't work since the file names will include the entire path including the different directory names. – juharr Jun 14 '16 at 18:30
  • I don't see how this solves the problem and why it is marked as answer?!?! With `Union` all we have is a list, containing filenames from both directories. It isn't useful since we only the duplicated names in both directories. We need `Intersect` here... – Bozhidar Stoyneff Jun 14 '16 at 19:50
  • 1
    @BozhidarStoinev: He wants the files from both lists, but files with the same filename should appears only one – Arturo Menchaca Jun 14 '16 at 19:52
  • @ArturoMenchaca: Aw... the title made me think that he wants them physically deleted, so I though he just needs the duplicates... Isn't that the case? – Bozhidar Stoyneff Jun 14 '16 at 19:55
  • @BozhidarStoinev: In that case then yes, he will need Intersect. But when he says `How can I remove any duplicate files from the combined list?` I believe he means to the duplicates in the list. :) – Arturo Menchaca Jun 14 '16 at 20:04
  • @ArturoMenchacaL Yeah... it's confusing. Anyway, hope he'll make he's mind by reading this... – Bozhidar Stoyneff Jun 14 '16 at 20:25