54

Why can't I preallocate a hashset<T>?

There are times when i might be adding a lot of elements to it and i want to eliminate resizing.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
EpiX
  • 1,281
  • 2
  • 16
  • 22
  • 1
    The short answer is "coz". Coz simply that's how MS decided to write it in their infinite wisdom. Doesn't help you, but that's the way things are in this case. However, if you have a predefined collection of items that you want to drop into it upon creation, then use `new HashSet(items)` – Will Jul 21 '11 at 06:12
  • No, because despite the List having 9999 empty slots, it has none used, and it's only the used ones that are copied into the HashSet. If you were able to build your data beforehand into a list or similar, you could then dump it into a HashSet in one easy step. Unfortunately, what you are after doesn't exist in HashSets. Perhaps a Dictionary instead, at the cost of having an unused value for each entry. – Will Jul 21 '11 at 06:19
  • Or use the non-Generic, non-typesafe, Hashtable, which supports capacity. (I think I just threw up a little in my mouth) – Will Jul 21 '11 at 06:22
  • You could just implement your own type safe HashTable. Would be pretty straight forward. – brendan Jul 21 '11 at 06:33
  • 2
    This is changing with .NET 4.7.2. It adds a new constructor that takes capacity argument. – nawfal Mar 20 '18 at 06:02

5 Answers5

33

Answer below was written in 2011. It's now in .NET 4.7.2 and .NET Core 2.0; it will be in .NET Standard 2.1.


There's no technical reason why this shouldn't be possible - Microsoft just hasn't chosen to expose a constructor with an initial capacity.

If you can call a constructor which takes an IEnumerable<T> and use an implementation of ICollection<T>, I believe that will use the size of the collection as the initial minimum capacity. This is an implementation detail, mind you. The capacity only has to be large enough to store all the distinct elements...

EDIT: I believe that if the capacity turns out to be way larger than it needs to be, the constructor will trim the excess when it's finished finding out how many distinct elements there really are.

Anyway, if you have the collection you're going to add to the HashSet<T> and it implements ICollection<T>, then passing it to the constructor instead of adding the elements one by one is going to be a win, basically :)

EDIT: One workaround would be to use a Dictionary<TKey, TValue> instead of a HashSet<T>, and just not use the values. That won't work in all cases though, as it won't give you the same interface as HashSet<T>.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    Just checked with reflector. You're right about `ICollection` – Ivan Danilov Jul 21 '11 at 06:15
  • 3
    It will use the collection's size as the initial capacity, but will trim the excess to no more than about 3 times the size of the actual number of distinct elements. So if you pass in an uninitialized array with 1M elements, at first it will create a large array internally, but after discovering that there is only 1 unique element it will resize its internal array to 3 elements. – Gabe Jul 21 '11 at 06:18
  • So i could do something like HashSet asdf = new HashSet(new List(9999)); Thanks edit: :( – EpiX Jul 21 '11 at 06:19
  • 1
    @EpiX: If you read my comment above yours, you'll see that `HashSet asdf = new HashSet(new List(9999));` will *not* do what you want. – Gabe Jul 21 '11 at 06:22
  • @Gabe: Yup, discovered that after posting :) – Jon Skeet Jul 21 '11 at 06:35
  • Thanks for this answer - it almost contains a solution :) I've posted it below. – BartoszKP Apr 15 '14 at 16:34
  • 4
    I really wish Microsoft would make a constructor that took an int, even if it was just to make the API between built-in collections more consistent. – JPtheK9 May 04 '15 at 00:44
  • Since this answer is the top-most answer, commenting here for other's reference -- This capability was added in [4.7.2](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.hashset-1.-ctor?view=netframework-4.7.2#System_Collections_Generic_HashSet_1__ctor_System_Int32_); refer David's answer below. – sanchitkum Jul 13 '19 at 20:14
  • Gah, very annoying (and a bit strange) that this is available in net472 but *not* netstandard2.0 :/ – Cocowalla Feb 21 '20 at 10:20
11

The answer by Jon Skeet is almost a complete one. To solve this problem with HashSet<int> I had to do the following:

public class ClassUsingHashSet
{
    private static readonly List<int> PreallocationList
        = Enumerable.Range(0, 10000).ToList();

    public ClassUsingHashSet()
    {
        this.hashSet = new HashSet<int>(PreallocationList);
        this.hashSet.Clear();
    }

    public void Add(int item)
    {
        this.hashSet.Add(item);
    }

    private HashSet<int> hashSet;
}

This trick works because after Clear the HashSet is not trimmed, as described in the documentation:

The capacity remains unchanged until a call to TrimExcess is made.

BartoszKP
  • 34,786
  • 15
  • 102
  • 130
  • 2
    Brilliant! One can create a public static generic method to return a HashSet this way. Not sure about the cost of creating the preallocation list. – nawfal May 25 '14 at 15:34
  • @nawfal It's a one time price anyway, so could be worth it when you have many instances of `ClassUsingHashSet`. – BartoszKP May 26 '14 at 10:17
  • I get that. Btw, I take back my original comment that you can make this generic. Its not possible, at least easily, to populate the preallocation list :) – nawfal May 26 '14 at 10:25
  • @nawfal The easy version of your idea is partially possible - with the `new` generic constraint. – BartoszKP May 26 '14 at 10:29
  • not as a generic solution. The reason is that, two `new T()`s can be equal depending on the equality desired. In that case, the HashSet will add only one item for the entire 10000 additions. Sadly, HashSet constructor will trim the excess size (if it is beyond some ratio which is a very small number). I just saw reflector. Otherwise it was very easy to achieve this, in a way faster than yours - just do `var initializer = new T[10000];` and pass it to constructor. Array initialization is dead simple. – nawfal May 26 '14 at 10:34
  • 1
    @nawfal Ah, yes... right, good point, `new` won't help here reliably. – BartoszKP May 26 '14 at 10:35
  • 1
    This is only a good solution if you're allocating an ObjectPool of hashset – Chris Marisic Sep 10 '15 at 01:30
  • @BartoszKP but it's copying every value from initial collection, doing N hash calculations and N copyings. I don't think it's more efficient than just use `HashSet` without preallocation. – Alex Zhukovskiy Feb 04 '17 at 12:34
  • @AlexZhukovskiy Comparing this whole operation with single `HashSet` creation without preallocation it's the same. The point of preallocating hash set is to avoid reallocations when you're using it later. So, comparing an ordinary insert operation vs. insert operation into a preallocated hash set the latter is better on average. – BartoszKP Feb 04 '17 at 21:20
9

I'm using this code to set initial capacity for HashSet. You can use it as extension or directly

public static class HashSetExtensions
{
    private const BindingFlags Flags = BindingFlags.Instance | BindingFlags.NonPublic;
    public static HashSet<T> SetCapacity<T>(this HashSet<T> hs, int capacity)
    {
        var initialize = hs.GetType().GetMethod("Initialize", Flags);
        initialize.Invoke(hs, new object[] { capacity });
        return hs;
    }

    public static HashSet<T> GetHashSet<T>(int capacity)
    {
        return new HashSet<T>().SetCapacity(capacity);
    }
}

upd. 04 jule

This code may be also enhanced by using reflection caching. Here we go:

public static class HashSetExtensions
{
    private static class HashSetDelegateHolder<T>
    {
        private const BindingFlags Flags = BindingFlags.Instance | BindingFlags.NonPublic;
        public static MethodInfo InitializeMethod { get; } = typeof(HashSet<T>).GetMethod("Initialize", Flags);
    }

    public static void SetCapacity<T>(this HashSet<T> hs, int capacity)
    {
        HashSetDelegateHolder<T>.InitializeMethod.Invoke(hs, new object[] { capacity });
    }

    public static HashSet<T> GetHashSet<T>(int capacity)
    {
        var hashSet = new HashSet<T>();
        hashSet.SetCapacity(capacity);
        return hashSet;
    }
}
Alex Zhukovskiy
  • 9,565
  • 11
  • 75
  • 151
  • Shouldn't this be `new object[capacity]`? Otherwise you're creating an object array containing one element (the int, capacity). – Patrick M Jul 07 '14 at 06:31
  • @PatrickM The same method is called internally when `ICollection` constructor is called. And it clearly do `new T[capacity]` – Alex Zhukovskiy Jul 07 '14 at 08:37
  • @PatrickM so yes, it should be array with single parameter `capacity` - see method signature. – Alex Zhukovskiy Nov 25 '14 at 05:27
  • 1
    It is worth mentioning that this relies on a .NET implementation detail, and may break in any version, or even within the same version due to patches. – Aidiakapi Apr 19 '15 at 14:50
  • @Aidiakapi we are working with implementation details everyday. `Value types goes on the stack`, `LOH size is 85k bytes`, `integer ariphmetics is faster than float one` and so on. Nothing to do, if you need then it you need it. – Alex Zhukovskiy Jul 04 '16 at 11:00
  • Is the overhead of using reflection even worth the gains in most cases? – MarioDS Apr 26 '18 at 07:33
  • 1
    @MarioDS The answer is: it depends. If you're fine with `where T : new()` constraint which actually uses `Activator.CreateInstance` and you didn't face any performance issues then it's ok. Calling one single method through reflection is obsiosly better than several reallocation of array. IIRC HashSet doubles its capacity to the nearest prime, so if you want to create a HashSet with 10000 items, and starting capacity is 4, you will have ~6 reallocation with `O(n^2)` allocated memory. – Alex Zhukovskiy Apr 26 '18 at 09:51
7

This capability was added in 4.7.2:

HashSet<T>(Int32)

Initializes a new instance of the HashSet<T> class that is empty, 
but has reserved space for capacity items and uses the default 
equality comparer for the set type.
David Wohlferd
  • 7,110
  • 2
  • 29
  • 56
0

The only way to initialize the HashSet with an initial capacity is to construct it with a instnace of a class, such as a List<T>, that implements ICollection<T>. It will call Count on the ICollection<T> allocate enough space to hold the collection and add all the elements to the HashSet without reallocation.

chuckj
  • 27,773
  • 7
  • 53
  • 49
  • 1
    It then trims if the capacity is much larger than required though, so it's not a general-purpose solution. – Jon Skeet Jul 21 '11 at 06:36