1

I have got over 600k lines of string. I want to group same strings and learn their counts.

So example

i go to school
i like music
i like games
i like music
i like music
i like games
i like music

So result will be

i go to school , 1
i like games  , 2
i like music , 4

How can I do that with the fastest possible way?

ekad
  • 14,436
  • 26
  • 44
  • 46
Furkan Gözükara
  • 22,964
  • 77
  • 205
  • 342

4 Answers4

5

The GroupBy method is what you want. You'll need your strings to be in a list or something that implements IEnumerable<string>. The File.ReadLines suggested by spender will return an IEnumerable<string> that reads the file line by line.

var stringGroups = File.ReadLines("filename.txt").GroupBy(s => s);
foreach (var stringGroup in stringGroups)
    Console.WriteLine("{0} , {1}", stringGroup.Key, stringGroup.Count());

If you want them in order of least to most (as in your example) just add an OrderBy

...
foreach (var stringGroup in stringGroups.OrderBy(g => g.Count()))
    ...
Ray
  • 45,695
  • 27
  • 126
  • 169
  • Yes i can also read line by line but in that condition what would be the best approach ? – Furkan Gözükara Jan 12 '12 at 11:21
  • 1
    How are the lines stored? In a file? File.ReadLines is your friend. It returns an `IEnumerable` that you can use for LINQ statements, but does not load the entire file into memory. http://msdn.microsoft.com/en-us/library/dd383503.aspx – spender Jan 12 '12 at 11:27
  • 1
    @Ray from .net4 your method for reading line by line is redundant. See my previous comment. – spender Jan 12 '12 at 11:28
  • @spender, sweet! I actually tried my code and it didn't work anyway. Will add your suggestion. – Ray Jan 12 '12 at 11:29
3

You can use Linq to implement it

IEnumerable<string> stringSource = File.ReadLines("C:\\file.txt");

var result = stringSource
    .GroupBy(str => str)
    .Select(group => new {Value = group.Key, Count = group.Count()})
    .OrderBy(item => item.Count)
    .ToList();

foreach(var item in result)
{
    // item.Value - string value
    // item.Count - count
}
Viacheslav Smityukh
  • 5,652
  • 4
  • 24
  • 42
2

Another, "oldschool" approach is iterating all lines and add them to a Dictioary(if not already present). The key is the line and the value is the count.

var d = new Dictionary<string, Int32>();
foreach (var line in File.ReadAllLines(@"C:\Temp\FileName.txt"))
     if (d.ContainsKey(line)) d[line]++; else d.Add(line, 1);

The advantage is, that works also on earlier framework versions.

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • Are you sure that ReadAllLines to "I have got over 600k lines of string" is a good idea? – Viacheslav Smityukh Jan 12 '12 at 11:31
  • @Viacheslav Smityukh: The question was not not IO related, actually the OP only said that he *has* 600k lines. My ReadAllLines is only an example on how to get them from file-system. I could easily remove that line from my answer. – Tim Schmelter Jan 12 '12 at 11:33
  • Ray solution works awesome with extreme speed :) But thanks for the answer. – Furkan Gözükara Jan 12 '12 at 11:33
  • @Viacheslav Smityukh: Anyway, changed it to use [ReadLines](http://msdn.microsoft.com/en-us/library/dd383503.aspx#Y1365) what might be faster than [ReadAllLInes](http://msdn.microsoft.com/en-us/library/system.io.file.readalllines.aspx). But it is a 4.0 method and only faster when you start enumerating the collection of strings before the whole collection is returned. – Tim Schmelter Jan 12 '12 at 11:40
  • GroupBy uses `Lookup` which uses a HashTable anyway. http://stackoverflow.com/questions/8775395/lookup-class-in-linq-what-is-the-underlying-data-structure – Ray Jan 12 '12 at 11:42
  • Bear in mind, that changing it to use ReadLines now makes it incompatible with versions of .Net before 4.0 – Chris Dunaway Jan 12 '12 at 17:57
  • @Chris: You are right and i've changed the code again to use `ReadAllLines` ;) – Tim Schmelter Jan 12 '12 at 18:09
  • @MonsterMMORPG: Btw, this approach also takes only 200millis to group and count 700k lines into a Dictionary. – Tim Schmelter Jan 12 '12 at 18:48
2

you can try this :


var groupedLines = System.IO.File.ReadAllLines(@"C:\temp\samplelines.txt").GroupBy(x=>x);
groupedLines.ToList().ForEach(y => Console.WriteLine("Content: {0} - Occurences: {1}", y.Key, y.Count()));

Giorgio Minardi
  • 2,765
  • 1
  • 15
  • 11