9

I stumbled upon the following problem.
I want a hashset with all numbers from 1 to 100.000.000. I tried the following code:

var mySet = new HashSet<int>();
for (var k = 1; k <= 100000000; k++)
     mySet.Add(k);

That code didn't make it since I got memory overflow somewhere around the 49mil. This was also pretty slow and memory grew excessively.

Then I tried this.

var mySet = Enumerable.Range(1, 100000000).ToHashSet();

where ToHashSet() is the following code:

public static HashSet<T> ToHashSet<T>(this IEnumerable<T> source)
{
    return new HashSet<T>(source);
}

I got a memory overflow again but I was able to put in more numbers then with the previous code.

The thing that does work is the following:

var tempList = new List<int>();
for (var k = 1; k <= 100000000; k++)
     tempList.Add(k);

var numbers = tempList.ToHashSet();

It takes about 800ms on my system to just fill the tempList where the Enumerable.Range() only takes 4 ticks!

I do need that HashSet or else it would take to much time to lookup values (I need it to be O(1)) and it would be great if I could do that the fastest way.

Now my question is:
Why do the first two methods cause a memory overflow where the third doesn't?

Is there something special HashSet does with the memory on initializing?

My system has 16GB memory so i was quite surprised when I got the overflow exceptions.

Mixxiphoid
  • 1,044
  • 6
  • 26
  • 46
  • 4
    One thing to note is that `Enumerable.Range` is so quick because it doesn't actually generate anything when you run it. Its only when it is used (ie in the `ToHashSet` call) that it actually starts generating numbers. – Chris Jul 19 '12 at 09:02
  • @Chris Didn't knew that. Thanks :). – Mixxiphoid Jul 19 '12 at 09:18
  • Its the same with all the linq type enumerable stuff. If you did a Where on an enumerable or a Select or any number of other things that basically return more ienumerables it will defer their execution until they are used. Its useful to know this since you can have a few gotchas due to this behaviour (though offhand I can't think of a concise example). – Chris Jul 19 '12 at 09:28
  • You might want to see [why-cant-i-preallocate-a-hashsett-c-sharp](http://stackoverflow.com/questions/6771917/why-cant-i-preallocate-a-hashsett-c-sharp) also – nawfal May 25 '14 at 10:56

4 Answers4

10

Like other collection types, the HashSet will automatically increase its capacity as required as you add elements. When adding a large number of elements, this will result in a large number of reallocations.

If you initialize it with a constructor that takes an IEnumerable<T>, it will check if the IEnumerable<T> is in fact an ICollection<T>, and if so, initialize the HashSet's capacity to the size of the collection.

This is what's happening in you're third example - you're adding a List<T> which is also an ICollection<T>, so your HashSet is given an initial capacity equal to the size of the list, thus ensuring that no reallocations are needed.

You will be even more efficient if you use the List<T> constructor that takes a capacity parameter, as this will avoid reallocations when building the list:

var noElements = 100000000;
var tempList = new List<int>(noElements); 
for (var k = 1; k <= noElements; k++) 
     tempList.Add(k); 

var numbers = tempList.ToHashSet(); 

As for your system memory; check if this is a 32-bit or 64-bit process. A 32-bit process has a maximum of 2GB memory available (3GB if you've used the /3GB startup switch).

Unlike other collection types (e.g. List<T>, Dictionary<TKey,TValue>), HashSet<T> doesn't have a constructor that takes a capacity parameter to set the initial capacity. If you want to initialize a HashSet<T> with a large number of elements, the most efficient way to do so is probably to first add the elements to an array or List<T> with the appropriate capacity, then pass this array or list to the HashSet<T> constructor.

Joe
  • 122,218
  • 32
  • 205
  • 338
  • So when the HashSet is reallocating memory is it actually ditching the old memory and using a compeltely new set, thus leaving the old memory floating around in limbo until GC or something? Otherwise I can understand why this would be faster but not why it prevents out of memory exceptions... – Chris Jul 19 '12 at 09:00
  • 1
    @Chris, exactly, the old memory is eligible for GC when it's discarded, but probably the GC hasn't kicked in yet. – Joe Jul 19 '12 at 09:04
  • The application is a x64 application. I now see why it is indeed more efficient to first set the capacity. I didn't know that ICollection was behaving like that! Thanks a lot – Mixxiphoid Jul 19 '12 at 09:15
  • HashSet has nowadays an initial capacity parameter. It looks like it was introduced in .NET 4.7.2 ( around 4 years after this question was asked ) https://stackoverflow.com/a/6771986/10728554 https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.hashset-1.-ctor?view=net-6.0#System_Collections_Generic_HashSet_1__ctor_System_Int32_ – mastef Nov 19 '21 at 03:45
2

I guess HashSet<T>, like most .net collections, uses the array doubling strategy for growth. Unfortunately there are no constructor overloads that take a capacity.

But if it checks for ICollection<T> and used ICollection<T>.Count as initial capacity you can implement a rudimentary implementation of ICollection<T> that implements GetEnumerator() and Count. That way you can directly fill the HashSet<T> without materializing a temporary List<T>.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
1

If you put 100 million ints into a hashset that will consume 1.5GB (my machine) If you create a bool[100000000] where you set each number you've had to true it takes only 100MB and also looks up faster than a hashset. This assumes the ints range from 0-100000000

IvoTops
  • 3,463
  • 17
  • 18
  • The lookup speed of a HashSet is O(1) how can the bool array be faster then that? – Mixxiphoid Jul 22 '12 at 06:49
  • 2
    Direct array lookup is also O(1) but calculating a hash and getting data from a bucket is more expensive as looking up an entry in an array. And the use of 15 times more memory (probably because the hashset wraps all ints to objects) is also not a negligeble difference.. – IvoTops Jul 22 '12 at 20:41
  • Thanks for the elaboration. I will have to change my code quite a bit if I would implement it, but I will surely try. Thanks for the suggestion. – Mixxiphoid Jul 23 '12 at 07:00
0

HashSet grows by doubling and that allocation causes it to exceed available memory.

On a 64-bit system a HashSet can hold upwards of 89 million items.

On a 32-bit system the limit is about 61.7 million items.

that's why you are getting memory overflow exception

for more info

http://blog.mischel.com/2008/04/09/hashset-limitations/

Massimiliano Peluso
  • 26,379
  • 6
  • 61
  • 70
  • Thats not true. I actually do have a HashSet with 100mil items. And thats on a x64 platform/application. – Mixxiphoid Jul 19 '12 at 09:24
  • Can you clarify what you mean here? The final solution that works from the OP seems to be putting 100 million items in. Are the above figures talking about how long til you run into memory limitations by the doubling strategy? – Chris Jul 19 '12 at 09:26
  • Ah, sorry I was misunderstanding your answer. That is indeed true for adding the items in a loop. (and therefore triggering the doubling) – Mixxiphoid Jul 19 '12 at 09:37