45

My C# program generates random strings from a given pattern. These strings are stored in a list. As no duplicates are allowed I'm doing it like this:

List<string> myList = new List<string>();
for (int i = 0; i < total; i++) {
  string random_string = GetRandomString(pattern);
  if (!myList.Contains(random_string)) myList.Add(random_string);
}

As you can imagine this works fine for several hundreds of entries. But I'm facing the situation to generate several million strings. And with each added string checking for duplicates gets slower and slower.

Are there any faster ways to avoid duplicates?

Robert Strauch
  • 12,055
  • 24
  • 120
  • 192
  • use set for avoiding duplicates – Jayram Jun 24 '13 at 14:58
  • 1
    would it be faster too add them all, then use Distinct() to check for duplicates, then add back the number that were removed? – Jonesopolis Jun 24 '13 at 14:59
  • 1
    @Jonesy: That sounds like something worth testing for a particular data set. If it does turn out to be faster then one would weigh that performance optimization against the obfuscation it adds to the code (which isn't much in this case). – David Jun 24 '13 at 15:01
  • Just out of interest, what exactly are you using these for? – musefan Jun 24 '13 at 15:02
  • 5
    @David I'd likely make the theoretical argument that `HashSet` would be faster because of less memory impact initially and no need to iterate fully afterwards. The cost of checking each item still exists, but that data structure is optimised for it. – Adam Houldsworth Jun 24 '13 at 15:02
  • @musefan I need those to generate serial numbers for documents. – Robert Strauch Jun 24 '13 at 15:13
  • 1
    @Robert Could you use a `GUID` for each document? – Servy Jun 24 '13 at 15:16
  • If you are persisting your list to a database, you could also try making the field unique and then if the INSERT fails you can try a different one - just something else to consider – musefan Jun 24 '13 at 15:22
  • @Servy No unfortunately. The pattern is kind of special so GUIDs won't help. – Robert Strauch Jun 24 '13 at 15:23
  • @musefan Doing an entire DB round trip just to find out that the string already exists would be... a problem. – Servy Jun 24 '13 at 15:24
  • @Servy: Depends how likely it is to conflict. If the program is having to load the List from the DB in the first place, it might be an acceptable trade-off. – musefan Jun 24 '13 at 15:26
  • 1
    @musefan Making even a single DB query to determine if an item already exists in the DB would take longer than hundreds of thousands, if not millions of checks to see if an item exists in a hashset in memory. Using a DB to solve this particular problem could easily be a several thousand times slowdown. – Servy Jun 24 '13 at 15:29
  • @Servy: Fair enough, you are likely correct, it certainly sounds logical anyway – musefan Jun 24 '13 at 15:32

7 Answers7

64

Use a data structure that can much more efficiently determine if an item exists, namely a HashSet. It can determine if an item is in the set in constant time, regardless of the number of items in the set.

If you really need the items in a List instead, or you need the items in the resulting list to be in the order they were generated, then you can store the data in both a list and a hashset; adding the item to both collections if it doesn't currently exist in the HashSet.

Servy
  • 202,030
  • 26
  • 332
  • 449
  • 1
    Ok, so I used a `HashSet` and the increase in speed is enormous. However I do have a new problem. I need a certain amount of entries in the hash set. If I use for for-loop as in my question then it stops after 2,000,000 cycles. Duplicates don't exist in the hash set but if a duplicate is hit the hash set does not have 2,000,000 entries. How could I avoid that? `if (myList.Count < 2000000) myList.Add(random_string);` prevents this but is, again, kind of slow. – Robert Strauch Jun 24 '13 at 19:04
  • 3
    @Robert Rather than `for(int i = 0; i < numItems; i++)` just use `for(int i = 0; set.Count < numItems; i++)`. Or, if you don't actually need `i` at all, then just `while(set.Count < numItems)`. – Servy Jun 24 '13 at 19:10
  • it seems,that finding item for HasSet is O(1), so if you find this item=add it to doublicate list. – user2545071 Aug 24 '15 at 13:10
17

The easiest way is to use this:

myList = myList.Distinct().ToList();

Although this would require creating the list once, then creating a new list. A better way might be to make your generator ahead of time:

public IEnumerable<string> GetRandomStrings(int total, string pattern)
{
    for (int i = 0; i < total; i++) 
    {
        yield return GetRandomString(pattern);
    }
}

...

myList = GetRandomStrings(total, pattern).Distinct().ToList();

Of course, if you don't need to access items by index, you could probably improve efficiency even more by dropping the ToList and just using an IEnumerable.

p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • 4
    Using `.Distinct` to remove several million strings in a list doesn't sound that efficient IMO. – Darren Jun 24 '13 at 15:00
  • @DarrenDavies Internally, `Distinct` uses a `HashSet`, just as others have suggested. The only inefficient part of is generating the list first, then using distinct, which I've addressed in the second part of my answer. – p.s.w.g Jun 24 '13 at 15:05
  • 1
    @p.s.w.g I assume your `GetRandomStrings` method meant to `yield` the string, not just set it to a local then throw it away. – Servy Jun 24 '13 at 15:09
  • Also, if there are some number of string that you need in the result it might make sense to have `GetRandomStrings` generate an infinitely long sequence and then use `Take` to limit it to the desired size. You could then put the `Take` either before or after the `Distinct`, depending on whether you want to specify the number of strings generated or the number of *unique* strings generated. – Servy Jun 24 '13 at 15:11
  • @Servy I had initially implemented it like that, but infinite generators can be dangerous, and need to be handled with some care. – p.s.w.g Jun 24 '13 at 15:14
  • One issue with `Distinct`, if I read the documentation correctly, is that the order of statements is not preserved. If the order that items were inserted is important, then a data structure consisting of a `List` and (hidden) `HashSet`, as @Servy suggested, is a better solution. – AlainD Nov 23 '21 at 10:40
10

Don't use List<>. Use Dictionary<> or HashSet<> instead!

catfood
  • 4,267
  • 5
  • 29
  • 55
10

You could use a HashSet<string> if order is not important:

HashSet<string> myHashSet = new HashSet<string>();
for (int i = 0; i < total; i++) 
{
   string random_string = GetRandomString(pattern);
   myHashSet.Add(random_string);
}

The HashSet class provides high-performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.

MSDN

Or if the order is important, I'd recommend using a SortedSet (.net 4.5 only)

DGibbs
  • 14,316
  • 7
  • 44
  • 83
  • 3
    Note that the `SortedSet` sorts the elements. If a ordered set is required (i.e. the element order is maintained)`OrderedDictionary` would be a better choice. The drawback is that it is not generic. – Olivier Jacot-Descombes Jun 24 '13 at 15:17
  • How do I get the hashed object then? HashSet does NOT have a GET neither is it very efficient to implement your self. – Piotr Kula Aug 23 '16 at 13:35
2

not a good way but kind of quick fix, take a bool to check if in whole list there is any duplicate entry.

bool containsKey;
string newKey;

    public void addKey(string newKey){

         foreach(string key in MyKeys){
           if(key == newKey){
             containsKey = true;
          }
         }

      if(!containsKey){
       MyKeys.add(newKey);
     }else{
       containsKey = false;
     }

    }
Amir Javed
  • 19
  • 2
0

A Hashtable would be a faster way to check if an item exists than a list.

Z .
  • 12,657
  • 1
  • 31
  • 56
  • 2
    He doesn't have a key/value relationship, just a bunch of strings, so he needs a set not a map. Also, HashTable is not generic; you should be using the generic `Dictionary` instead if you really do need a map structure. You shouldn't ever be using a HashTable in non legacy code. – Servy Jun 24 '13 at 15:00
0

Have you tried:

myList = myList.Distinct()
jdehlin
  • 10,909
  • 3
  • 22
  • 33