An Optimized algorithm to make a distinct array

Question

I'm looking for an optimized algorithm that give an array (or list) of a struct that I wrote and remove duplicated elements and return it.
I know I can do it by a simple algorithm with complexity of O(n^2); But I want a better algorithm.

Any help will be appreciated.

There is no reason to re-invent the wheel. Default implementation of `Distinct()` is already optimized. Use it and be happy. — Nikita B, Jul 04 '13 at 08:11
Does the algorithm need to be _stable_ (i.e. keep the surviving elements in the original order)? — Branko Dimitrijevic, Jul 04 '13 at 08:15

Matthew Watson · Answer 1 · 2013-07-04T08:36:28.537

This runs in close to O(N) time:

var result = items.Distinct().ToList();

[EDIT]

Since there is no documented proof from Microsoft that it is O(N) time, I did some timings with the following code:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        private void run()
        {
            test(1000);
            test(10000);
            test(100000);
        }

        private void test(int n)
        {
            var items = Enumerable.Range(0, n);
            new Action(() => items.Distinct().Count())
                .TimeThis("Distinct() with n == " + n + ": ", 10000);
        }

        static void Main()
        {
            new Program().run();
        }
    }

    static class DemoUtil
    {
        public static void TimeThis(this Action action, string title, int count = 1)
        {
            var sw = Stopwatch.StartNew();

            for (int i = 0; i < count; ++i)
                action();

            Console.WriteLine("Calling {0} {1} times took {2}",  title, count, sw.Elapsed);
        }
    }
}

The results are:

Calling Distinct() with n == 1000:   10000 times took 00:00:00.5008792
Calling Distinct() with n == 10000:  10000 times took 00:00:06.1388296
Calling Distinct() with n == 100000: 10000 times took 00:00:58.5542259

The times are increasing approximately linearly with n, at least for this particular test, which indicates that an O(N) algorithm is being used.

Probably true but do you have a reference? MSDN doesn't specify anything. — H H, Jul 04 '13 at 08:14
@HenkHolterman: It would be real daft to not do it in O(N) ;p — leppie, Jul 04 '13 at 08:22
@HenkHolterman Only this StackOverflow answer: http://stackoverflow.com/questions/2799427/what-guarantees-are-there-on-the-run-time-complexity-big-o-of-linq-methods — Matthew Watson, Jul 04 '13 at 08:23
@HenkHolterman I'm pretty sure `Distinct` uses the obvious `HashSet` based approach. — CodesInChaos, Jul 04 '13 at 08:30

CodesInChaos · Accepted Answer · 2013-07-04T17:05:30.603

For practical use LINQ's Distinct is the simplest solution. It uses a hashtable based approach, probably very similar to the following algorithm.

If you're interested in how such an algorithm would look like:

IEnumerable<T> Distinct(IEnumerable<T> sequence)
{
    var alreadySeen=new HashSet<T>();
    foreach(T item in sequence)
    {
        if(alreadySeen.Add(item))// Add returns false if item was already in set
            yield return;
    }
}

If there are d distinct elements and n total elements then this algorithm will take O(d) memory and O(n) time.

Since this algorithm uses a hashset, it requires well distributed hashes to achieve O(n) runtime. If the hashes suck, the runtime can degenerate to O(n*d)

score 2 · Answer 3 · answered Jul 04 '13 at 08:07

2

You can sort the array in O(NlogN) time, and compare adjacent elements to erase duplicate elements.

answered Jul 04 '13 at 08:07

Aravind

3,169
3
23
37

score 2 · Answer 4 · answered Jul 04 '13 at 08:09

2

You can use HashSet with complexity of O(N):

List<int> RemoveDuplicates(List<int> input)
{
    var result = new HashSet<int>(input);
    return result.ToList();
}

But it will increase memory usage.

answered Jul 04 '13 at 08:09

odinmillion

639
9
10

An Optimized algorithm to make a distinct array

4 Answers4