I noticed when trying to code a CLR procedure for SQL Server that HashSet is not allowed due to being attributed with [HostProtectionAttribute(SecurityAction.LinkDemand, MayLeakOnAbort = true)]
. SQL Server CLR procedures do not allow the use of objects where MayLeakOnAbort
is set. Okay, so there are some classes to avoid in CLR procedures, and maybe even think twice about using outside of CLR procedures. The strange thing is that Dictionary<K,V>
is not similarly restricted. Now based on my understanding of what a HashSet is and what a Dictionary is, I expect that a Dictionary should have all the complexity of a Hashset and then some. Why is it, then, that Dictionary is not similarly restricted? I'm doing my "think twice about using HashSet<T>
" and seriously considering using a Dictionary instead even though I'm not writing a CLR procedure and need nothing more than a collection that can be quickly tested for membership of a complex key (object reference for an object that has no comparison, hashing or equality interfaces defined). Am I better off using a Hashset or Dictionary? Is Hashset different in that it will allow the use of classes with no comparison or equality interfaces based purely on memory addresses or something which might be why a HashSet is less "clean"?

- 46,688
- 9
- 128
- 171

- 25,079
- 9
- 80
- 146
2 Answers
HashSet<T>
contains methods such as IntersectWith
that are implemented with unsafe code using stackalloc
. Dictionary<TKey, TValue>
does not contain any such methods. While it's possible to mark your own assembly as unsafe, and avoid the risky methods, I've simply given up and used Dictionary<T, bool>
in SQL CLR functions, where all values are true
, for precisely this reason.
-
Still, why can it leak memory? – usr Apr 17 '14 at 19:00
-
@usr Because thread abort exceptions don't work properly when invoking unmanaged code, and may not be given the opportunity to clean up those unmanaged resources. – Servy Apr 17 '14 at 19:01
-
1@Servy Actually, although it's unsafe code, it's still managed code, and I'm surprised that `stackalloc` would be a problem. – Apr 17 '14 at 19:04
-
And for my case where I'm developing code outside of CLR, should I just use HashSet and avoid certain methods, or use a Dictionary? I guess I'm most concerned about unnecessary overhead because my set is not really that large. How do I decide which class is appropriate? – BlueMonkMN Apr 17 '14 at 19:05
-
@BlueMonkMN When you're not dealing with SQL Server's restrictions, it's probably best to forget all about them, and if a hash set best describes the data structure you'd like to have, use `HashSet
`. – Apr 17 '14 at 19:08 -
Someone (well, Microsoft really) ought to make a "safe" HashSet class that excludes or reimplements the problem methods more safely... preferably replacing the existing HashSet. – BlueMonkMN Apr 17 '14 at 19:10
-
@BlueMonkMN What is the big deal with just using a Dictionary and not using the value? It excludes the unsafe code. – paparazzo Apr 17 '14 at 19:12
-
Not a big deal, just a pain for those discovering this problem for the first time. Many of them probably don't even know that Dictionary is a good alternative and wouldn't suspect it because a Dictionary would seem more complex and less safe. Kind of counter-intuitive. It seems silly (and confusing to other developers who have to read the code) to introduce extra complexity to get around something so simple. Certainly I wouldn't go too far out of my way, but I suspect HashSet could have been implemented without this drawback. – BlueMonkMN Apr 17 '14 at 19:17
-
4@BlueMonkMN At the very least (in hindsight), `IntersectWith` and others could have been implemented as extension methods, so that `HashSet
` itself would be safe, only `HashSetExtensions` wouldn't be. – Apr 17 '14 at 19:27 -
@hvd They wouldn't be able to access the underlying data structures of the type if they were extension methods, which would preclude the very performance optimizations being brought up here. – Servy Apr 17 '14 at 19:33
-
@Servy I don't think there's such a thing as inaccessible data in managed code. You can always retrieve even private members via reflection. – BlueMonkMN Apr 17 '14 at 19:35
-
1@BlueMonkMN If you're breaking out reflection then you're *definitely* not going to have positive performance gains. Reflection adds some pretty major performance costs. Even sticking to the public API would likely be better than using reflection. – Servy Apr 17 '14 at 19:36
-
@Servy Nevertheless, reflection must be what Entity Framework and/or LINQ-to-SQL relies upon heavily for much of its object tracking. And just because you use reflection to get at the data initially doesn't mean you have to constantly go through that interface, once you have access to the data, you should be able to hold on to a reference and access it as efficiently as any non-reflected member. It's not like you're recreating HashSets all the time; if you are, that would by far be your bottleneck, not the reflection. – BlueMonkMN Apr 17 '14 at 19:39
-
2@Servy As long as the fields that are currently private would be made internal, my hypothetical `HashSetExtensions` class could do exactly what the `HashSet` methods do now, couldn't it? – Apr 17 '14 at 19:40
-
@BlueMonkMN With EF you're comparing the work to network interactions, which of course is going to dwarf even reflection in performance costs. Pulling out a handful of fields just the once can be a pretty significant cost, from the point of view of a method like this which innately ought to be reasonably fast. – Servy Apr 17 '14 at 19:42
-
I would have thought that constructing a whole HashSet instance would also dwarf the cost of reflection. But the point is moot. We don't need reflection per hvd. – BlueMonkMN Apr 17 '14 at 20:20
Dictionary is based on HashTable, rather than HashSet. While they are conceptually very similar, the implementation of HashSet includes some unsafe methods, whereas HashTable and Dictionary do not.
Dictionary uses a HashTable primarily as a means of speeding searches of the keyspace. Given an efficient implementation of GetHashCode() on the type used for your dictionary key, lookups in a dictionary are best-case constant time and worst-case linear time.
The HashSet is a collection for storing unique values only (no keying mechanism), and requires a proper implementation of GetHashCode on your class to function properly.
HashTables and Dictionaries are used for looking up values by a key. HashSets are used solely for maintaining a set of unique objects and do not have a keying mechanism.
If you don't need a uniqueness guarantee, or the other functions provided by something that implements ISet, there's no real reason to use a HashSet instead of an array or list.
If you need the ability to get your items out of the collection by a key, use a HashTable or Dictionary (Dictionary is preferred, since it is generic-aware and thus you're not constantly boxing/unboxing everything).
See these links for explanations:
http://msdn.microsoft.com/en-us/library/bb397727(v=vs.110).aspx
http://msdn.microsoft.com/en-us/library/4yh14awz(v=vs.110).aspx

- 4,584
- 1
- 25
- 37
-
1This is mostly wrong, the differences between the dictionary and the set class are actually quite small. – Daniel Brückner Apr 17 '14 at 18:58
-
Dictionary is a collection of key/value pairs. Set and HashSet are just collections of values. They're fundamentally different for most use cases. – dodexahedron Apr 17 '14 at 18:58
-
1A Dictionary, at a conceptual level, is just a `HashSet
`, with a few tweaks to the public API. The underlying implementations are going to be the same basic algorithm, even if there are a few differences here or there. – Servy Apr 17 '14 at 19:00 -
They're not interchangeable types and shouldn't be used as such. (Yes, you can make a dictionary that has keys and all null values, but that IS a HashSet with extra wasted memory). If all you need is a collection of unique objects without a key, use a hashset. If you need to retrieve a specific object by a key, you use a Dictionary. If you need an indexed, orderable collection of objects in which there may be duplicates, use a list or array. The underlying implementation is not terribly important to the question if the use case doesn't jive with the class being used. – dodexahedron Apr 17 '14 at 19:09
-
1The question is about an implementation detail and therefore the implementation matters. And in this case both implementations are highly similar which then raised the question why only one of the implementations may leak. – Daniel Brückner Apr 17 '14 at 19:16
-
I suppose I worded that last sentence extremely poorly. The point is, a Dictionary
is NOT a HashSet – dodexahedron Apr 17 '14 at 19:19>, or else Dictionary would have the same problems. Thus, they are NOT the same underlying implementation, which is the entire point (and the first sentence) of my answer. Whether they are the same, conceptually, is immaterial to whether they are the same in actuality (which even you concede they are not). -
@dodexahedron You made the claim that they have radically different implementations. They do not. You also are pretty much ignoring the question itself, which is asking why a problem exists for `HashSet` and not `Dictionary`, which one would imagine ought to have a comparable implementation. Your answer of "they have radically different underlying algorithms" is just flat wrong. – Servy Apr 17 '14 at 19:20
-
@dodexahedron So the question remains, *how* are they different, and *why* is it that the dictionary does not have the same limitation as the hash set. You don't answer that question. – Servy Apr 17 '14 at 19:21
-
@Servy - You are correct. I should not have said "entirely different" (and will change that wording so as to avoid confusion by future people). But the statement that they have a different inheritance hierarchy is true, and, without diving into the specific implementation, does answer the question. But a bigger problem is the use case the OP stated is not addressed by the accepted answer. In fact, a HashSet is WRONG, for the use case he specified, if he hasn't properly defined GetHashCode on the elements of the set. – dodexahedron Apr 17 '14 at 19:26
-
@dodexahedron He is interested in having the reference as the identity of the object, in which case the default object equals/gethashcode is fine. Again, your answer in no way indicates why this restriction exists in a hash set and not in a dictionary, which is the question. It goes off explaining conceptual differences between the two constructs, which the OP clearly understands. If he didn't, he wouldn't fell wary about using a `Dictionary` to maintain a set of data in the first place. – Servy Apr 17 '14 at 19:32
-
The OP made the observation that he can not use `HashSet
` within SQL CLR but he can use `Dictionary – Daniel Brückner Apr 17 '14 at 19:33`. Now he wonders why this is the case because he reasonable assumed that you could easily implement a dictionary on top of a set and therefore the same limitations should apply to both classes, or at least the dictionary should have the stronger limitations. As it turns out the difference lies in optimized implementations of the set operations in `HashSet ` and without this optimizations it would be possible to use `HashSet ` within SQL CLR, too. -
@DanielBrückner - Right. Hence why pointing out that they're not implemented on top of each other is a valid (though yes, less precise) response. I don't dispute that the other points that have been brought up are better than my original answer. I'm just saying it's a logical fallacy to assume that, just because something CAN be done one way, that it HAS been done that way (especially in programming). Hence me pointing out they have different lineages. Whether or not Microsoft could have done this better is an entirely different (and interesting) discussion. – dodexahedron Apr 17 '14 at 19:38
-
The point of the question is that I *do* need a uniqueness guarantee (and in my case I only care about unique references so I have not implemented a GetHashCode function. If the references are different, the entries/keys are different). Do Dictionaries and HashSets both handle this, and is one better than the other at doing so if I don't care about any associated values? HashSet is the obvious answer, but seems to have a hidden downside, and that hidden downside is what I'm concerned with. – BlueMonkMN Apr 17 '14 at 20:29
-
In that case, then yes - Dictionary is the answer, and still has O(1) lookup time, as yes, it uses a hashtable to store the values. You may want to look into ConcurrentDictionary, as its usage semantics are more similar to a c++ Map (accessing by key creates the key, rather than throwing a KeyNotFoundException). It also has O(1) search time, but is a thread-safe collection. – dodexahedron Apr 17 '14 at 20:50
-
FYI: `ConcurrentDictionary` is incompatible with CLR stored procedures (in SAFE mode, at least). – Branko Dimitrijevic Dec 02 '14 at 19:21