1

I am searching a best performance method to group and count sequences with sorting using LINQ. I will be processing files even bigger than 500 MBs so performance is most important key in that task.

List<int[]> num2 = new List<int[]>();
num2.Add(new int[] { 35, 44 });
num2.Add(new int[] { 200, 22 });
num2.Add(new int[] { 35, 33 });
num2.Add(new int[] { 35, 44 });
num2.Add(new int[] { 3967, 11 });
num2.Add(new int[] { 200, 22 });
num2.Add(new int[] { 200, 2 });

The result have to be like this:

[35,   44]  => 2
[200,  22] => 2
[35,   33] => 1
[35,   44] => 1
[3967, 11] => 1
[200,  2 ] => 1

I have done something like this:

        Dictionary<int[], int> result2 = (from i in num2
                                       group i by i into g
                                       orderby g.Count() descending
                                       select new { Key = g.Key, Freq = g.Count() })
                          .ToDictionary(x => x.Key, x => x.Freq);

        SetRichTextBox("\n\n Second grouping\n");

        foreach (var i in result2)
        {
            SetRichTextBox("\nKey: ");
            foreach (var r in i.Key)
            {
                SetRichTextBox(r.ToString() + "  ");
            }

            SetRichTextBox("\n  Value: " + i.Value.ToString());

        }

But it is not working properly. Any help?

ekad
  • 14,436
  • 26
  • 44
  • 46
maszynaz
  • 309
  • 4
  • 11
  • What isn't working? What error are you getting? – Andrew Coonce Oct 23 '13 at 20:10
  • `to group and count sequences` what group? what count? your example doesn't show any thing meaningful for what you said, poorly explained question – King King Oct 23 '13 at 20:12
  • Are the arrays always of length 2? – Rob Lyndon Oct 23 '13 at 20:17
  • Each int array should be printed once and showed how many times it occured in the list. Each array have tha same size but not always 2. – maszynaz Oct 23 '13 at 20:18
  • Why do you use List and how do you get it? If you build it yourself during file reading, then directly use Dictionary. Such approach will save you from groupby step. You will need to sort dictionary by values. – Shad Oct 23 '13 at 20:39
  • Shad I will check your idea. – maszynaz Oct 23 '13 at 20:57
  • You'll need to define a hashcode for your key, which may be a way to go. There isn't a default one for int[] because its expensive, but you can use a custom type of your own that implements its own hashcode. – Rob Lyndon Oct 24 '13 at 14:49

2 Answers2

1

For arrays of length 2, this will work.

num2.GroupBy(a => a[0])
    .Select(g => new { A0 = g.Key, A1 = g.GroupBy(a => a[1]) })
    .SelectMany(a => a.A1.Select(a1 => new { Pair = new int[] { a.A0, a1.Key }, Count = a1.Count() }));

I think that should give you optimal performance; you could also try an .AsParallel() clause after your first Select statement.

This strategy (grouping successively by the n-th element of the arrays) generalises to arrays of arbitrary length:

var dim = 2;

var tuples = num2.GroupBy(a => a[0])
    .Select(g => new Tuple<int[], List<int[]>>(new [] { g.Count(), g.Key }, g.Select(a => a.Skip(1).ToArray()).ToList()));

for (int n = 1; n < dim; n++)
{
    tuples = tuples.SelectMany(t => t.Item2.GroupBy(list => list[0])
        .Select(g => new Tuple<int[], List<int[]>>(new[] { g.Count() }.Concat(t.Item1.Skip(1)).Concat(new [] { g.Key }).ToArray(), g.Select(a => a.Skip(1).ToArray()).ToList())));
}

var output = tuples.Select(t => new { Arr = string.Join(",", t.Item1.Skip(1)), Count = t.Item1[0] })
    .OrderByDescending(o => o.Count)
    .ToList();

which generates an output of

Arr = "35, 44", Count = 2
Arr = "200, 22", Count = 2
Arr = "35, 33", Count = 1
Arr = "200, 2", Count = 1
Arr = "3967, 11", Count = 1

in your example. I'll let you test it for higher dimensions. :)

You should be able to parallelise these queries without too much difficulties, as the successive groupings are independent.

Rob Lyndon
  • 12,089
  • 5
  • 49
  • 74
  • Hi, how about variable int arrays? – maszynaz Oct 23 '13 at 20:43
  • OK, there should be a way to do that. If your array size is n >= 1, you can perform an elegant Aggregate operation. Give me a few minutes. – Rob Lyndon Oct 23 '13 at 20:49
  • Done. OK, it was more than a few minutes. You could quite easily make an Aggregate out of this, but I don't think we need to make it any more complicated than it already is. – Rob Lyndon Oct 23 '13 at 21:47
  • Its performance isn't bad, but it could be better. For better performance, you need to find a good hashing algorithm for an array of its. – Rob Lyndon Oct 26 '13 at 18:42
0

You can do something like this:

var results = from x in nums
              group x by new { a = x[0], b = x[1] } into g
              orderby g.Count() descending
              select new
              {
                  Key = g.Key,
                  Count = g.Count()
              };

foreach (var result in results)
    Console.WriteLine(String.Format("[{0},{1}]=>{2}", result.Key.a, result.Key.b, result.Count));

The trick is to come up with a way to compare the values in the array, instead of the arrays themselves.

The alternative (and possibly better option) would be to transform your data from int[] to some custom type, override the equality operator on that custom type, then just group x by x into g, but if you're really stuck with int[] then this works.

Andrew Coonce
  • 1,557
  • 11
  • 19
  • The problem is, this doesn't generalise to arrays of arbitrary length, as OP requested in his comments. – Rob Lyndon Oct 23 '13 at 22:11
  • @RobLyndon: That's why I provided the alternative solution, one which is far more robust. The "the data is always a `List` of equal length `int[]`" has a bad smell, and this problem becomes trivial once you throw them into any custom format whatsoever. – Andrew Coonce Oct 24 '13 at 13:56