2

I am have trouble trying to extract only lines that are not duplicated and only lines that are only duplicates from a test file. The input file contains both duplicates and non-duplicate lines.

I have created a logging function and I can extract all unique lines from it to a separate file but that includes lines that are duplicates and lines that aren't, I need to separate them.

This is what I have so far;

static void Dupes(string path1, string path2)
{
    string log = log.txt;
    var sr = new StreamReader(File.OpenRead(path1));
    var sw = new StreamWriter(File.OpenWrite(path2));
    var lines = new HashSet<int>();
    while (!sr.EndOfStream)
    {
        string line = sr.ReadLine();
        int hc = line.GetHashCode();
        if (lines.Contains(hc))

            continue;

        lines.Add(hc);
        sw.WriteLine(line);

    }
    sw.Close();
}

Ideally this would be in two functions, so they can be called to perform different actions on the output contents.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Chimpin
  • 21
  • 1
  • Any reason you can't read all the lines into memory first? – Jonesopolis Dec 23 '14 at 16:06
  • How big is the input file? – Steve Dec 23 '14 at 16:07
  • @Steve question is relevant. If you can load the file into memory, Joney response is great. If you have memory size issue, your idea to use hash is good but can lead to hash collision with is bad (maybe another hash algorithm with bigger output size is preferable). But to implement it, you have to read your file twice, first to build a dictionary of 'line hash' to 'line count'. In a second time you re-read the file and use the dictionary to know if each line is present 1 or more times. – Orace Dec 23 '14 at 16:32

2 Answers2

4

use LINQ to Group items, then check the count:

var lines = File.ReadAllLines(path1);

var distincts = lines.GroupBy(l => l)
                    .Where(l => l.Count() == 1)
                    .Select(l => l.Key)
                    .ToList();

var dupes = lines.Except(distincts).ToList();

It's worth noting that Except doesn't return duplicates - something I just learned. So no need to call Distinct afterwards.

Jonesopolis
  • 25,034
  • 12
  • 68
  • 112
2

You can do as follow

var lines = File.ReadAllLines(path1);

var countLines = lines.Select(d => new
{
    Line = d,
    Count = lines.Count(f => f == d),
});

var UniqueLines = countLines.Where(d => d.Count == 1).Select(d => d.Line);
var NotUniqueLines = countLines.Where(d => d.Count > 1).Select(d => d.Line);
Patrick
  • 736
  • 9
  • 27