12

I need to store 4000 string of fixed size (8-char) in C#, but I do not know what is best to use regarding the space and time of adding and retrieving the item: Bloom filter, Hash table or Dictionary ? Please if any one can help me

Ed S.
  • 122,712
  • 22
  • 185
  • 265
Duaa
  • 171
  • 2
  • 7
  • 2
    Have you considered a simple `HashSet`? Also, if you want an answer that is *most* appropriate for your situation, you should provide more information. Is it a set of strings, or is each string-key associated with a value? Do you have any *specific* space / time requirements? What are the operations that will be performed on the collection? Any thread-safety requirements? Should it be immutable? Does it require any enumeration order? – Ani Jan 11 '11 at 01:20
  • 12
    I'd be surprised if you can retrieve the values from a bloom filter, that's for sure. – Chris Dennett Jan 11 '11 at 01:21
  • You cannot use Bloom filters to retrieve items, they simply indicate (with high probability) if an item exists in your set and for certain if an item does not exist in your set. Are you really talking about a Bloom filter + Set vs. HashTable alone comparison? – Mark Elliot Jan 11 '11 at 01:22
  • Thanks for your reply, I will support you with the needed details ... I just want a structure to test a membership of an item whether it is exist or not ... sorry if I wrote (retrieve), this is a mistake ... Also I just consern to store the (4000) strings only without any value, to test if any item exist or not without retrieving ... My strings are hex only; such as: 25AC7B2A, SO please can you tell me which is best structure to help me get a membership test with minimum space and time without retrieving the item ? sorry again for my mistake and yhanks alot dear – Duaa Jan 11 '11 at 02:22
  • There is TRIE structure. 1750x faster than hashset when searching text pattern. Slower insert. *Only good when searching text pattern. With Little to no inserts* hashset beats when searching ExactTEXT https://visualstudiomagazine.com/Articles/2015/10/20/~/media/ECG/visualstudiomagazine/Images/2015/10/1015vsm_CastanoFig8.ashx – bh_earth0 Dec 07 '17 at 18:03

3 Answers3

36

In this question, you really only have two data structures in C# since Dictionaries in C# are implemented using hash tables. So we'll refer to Dictionary and HashTable as both being hash tables. If you use one of them, then you probably want Dictionary due to type safety and performance as covered here: Why is Dictionary preferred over hashtable? But as a Dictionary is implemented using a hash table, it's not a huge difference either way.

But the real question is hash table (Dictionary) versus Bloom filter. Someone has previously asked the related question, What is the advantage to using bloom filters? They also link to the Wikipedia page on Bloom filters, which is quite informative: https://en.wikipedia.org/wiki/Bloom_filter The short versions of the answer is that Bloom filters are smaller and faster. They do, however, have a cost associated with this: they are not completely accurate. In a hash table, the original string is always stored for exact comparison. First you hash the value and this tells you where in the table to look. Once you've looked in the table, you then check the value located there against the value you're searching for. In a Bloom filter, you use multiple hashes to calculate a set of locations. If there are 1's in all of those locations, then you consider the string to be found. This means that sometimes strings will be "found" which were not originally inserted. If the table is too small, in fact, you could reach a saturation point where it would appear that any string you tried would be in the Bloom filter. Because you know how many strings you are going to be inserting, you can size the table appropriately to avoid this.

Let's look at the sizes involved. To make the numbers come out cleanly, I'm going to pretend that you have exactly 4096 strings. To have a relatively low-collision hash table, you would want your table to be at least as large as the number of strings. So, realistically (assuming 32 bit (4 byte) pointers), in this case, you'd be looking at a size of 4096*4 bytes = 16K for the table, plus 4096*(4+4+8) = 64K for the list nodes (next pointer + string pointer) and strings. So, in total, probably about 80K, which probably isn't very much memory in most situations where you would be using C#.

For Bloom filters, we have to decide the error rate we want to aim for in our size calculations. When we talk about a 1% error rate, it would mean that out of every 100 strings which were not inserted into the Bloom filter, 1 would be falsely indicated as being present. Strings which were inserted will always be correctly indicated as having been inserted. Using the equation m = -n*ln(p)/(ln(2)^2), we can calculate the minimum size to give us a certain error rate. In that equation, m is the number of slots in the table, p is the error rate, and n is the number of strings to be inserted. So, if we set p to be 0.01 (1% error), then we get approximately 9.6*4096 bits = 9.6*512 bytes = 4.8K, which is obviously quite a bit smaller. But, really, 1% is kind of high for an error rate. So more, realistically, we should probably go for something more like 0.0001% which comes out to 28.8*4096b bits = 28.8*512 bytes = 14.4K. Obviously, either of those are substantially smaller than the 80K we calculated for the hash table. However, the hash table has an error rate of 0 which is clearly less than either 1% or 0.0001%.

So, really, it's up to you whether or not, in your situation, the trade-off of losing some accuracy for gaining a little speed and a little time is worthwhile. Realistically, either option is likely to be small enough and fast enough for the vast majority of real world situations.

Community
  • 1
  • 1
Keith Irwin
  • 5,628
  • 22
  • 31
  • Thanks for your reply, I will support you with the needed details ... I just want a structure to test a membership of an item whether it is exist or not ... sorry if I wrote (retrieve), this is a mistake ... Also I just consern to store the (4000) strings only without any value, to test if any item exist or not without retrieving ... My strings are hex only; such as: 25AC7B2A, SO please can you tell me which is best structure to help me get a membership test with minimum space and time without retrieving the item ? sorry again for my mistake and yhanks alot dear – Duaa Jan 11 '11 at 02:23
  • @Duaa Here's a question about the advantages of Bloom filters versus hash functions: http://stackoverflow.com/questions/4282375/what-is-the-advantage-to-using-bloom-filters It also contains a link to the wikipedia page about Bloom Filters which may be helpful in making your decision. https://secure.wikimedia.org/wikipedia/en/wiki/Bloom_filter – Keith Irwin Jan 11 '11 at 07:29
  • @Duaa I've amended the answer to better meet the correction to the question you've shared. – Keith Irwin Jan 11 '11 at 08:03
  • @KeithIrwin, So isn't a bloom filter exactly the same as a hashtable without "actual" storage? – Pacerier Aug 14 '14 at 15:15
  • It's similar. The biggest difference is that a Bloom filter uses more than one hash function so that each value hits more than one "space". – Keith Irwin Nov 19 '14 at 08:35
  • @KeithIrwin, But what if *`k`* is **1**? What's the difference between a *`k=1`* bloomfilter vs a hashtable that doesn't store its values? Basically, is bloomfilter just another fancy term for a hashtable that doesn't store its values? – Pacerier Feb 09 '15 at 04:50
  • 2
    A k=1 bloom filter is the same as a hashtable which doesn't store its values. But, no, "bloomfilter" is not just a fancy term for a hashtable that doesn't store its values because there's no reason to use k=1. It's not an efficient choice in any non-trivial case. A bloom filter is a structure which can recognize values but doesn't store them. A hash table which doesn't store values is also a structure which recognizes values, but doesn't store them. But saying that bloom filters are hash tables is about like saying that bookcase is a table because bookcases with only one shelf are like tables. – Keith Irwin Feb 09 '15 at 07:40
3

A dictionary is an abstract data type that represents a mapping from one type to another. It doesn't specify what the implementation of the dictionary is - it could be backed by a hash table, a balanced binary search tree, a skip list, or one of many other structures. It's probably not appropriate here, because a dictionary associates one type of elements with some other type. You're not doing this - you're just concerned with storing elements - so this is probably inappropriate.

A Bloom filter is a probabilistic data structure that is good for checking whether or not an element is definitely not in a set, but cannot tell you for sure whether something is in the set. It's commonly used in distributed systems to avoid unnecessary network reads. Each computer can store a Bloom filter of what entries might be in a database, and can filter out obviously unnecessary network calls by not querying a remote system if an entry is ruled out by the filter. It's not very good for what you're trying to do, since the false positives are probably a deal-breaker.

The hash table, though, is a great data structure for what you want. It supports fast lookups and insertions of elements and, with a good implementation, can be extremely memory efficient. However, it doesn't store the elements in sorted order, which may be a problem depending on your application.

If you do want sorted order, there are two other structures you might want to consider. The first would be a balanced binary search tree, which supports fast lookup and deletion and stores elements in sorted order. There are many good implementations out there; virtually all good programming languages ship with an implementation. The other is the trie, which supports very fast lookup and access and maintains sorted ordering. It can be a bit space-inefficient depending on the distribution of your strings, but might be exactly what you're looking for.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • 1
    He asked about C# in particular. Although your description of Dictionary is correct in general, in C# it is implemented with a particular data structure and that structure is a hash table. – Keith Irwin Jan 11 '11 at 01:30
  • @Keith Irwin- Ah, I didn't recognize that. I'm not a C# person. :-) Thanks for pointing this out; I'll be sure to remember this in the future. – templatetypedef Jan 11 '11 at 01:31
  • Thanks for your reply, I will support you with the needed details ... I just want a structure to test a membership of an item whether it is exist or not ... sorry if I wrote (retrieve), this is a mistake ... Also I just consern to store the (4000) strings only without any value, to test if any item exist or not without retrieving ... My strings are hex only; such as: 25AC7B2A, SO please can you tell me which is best structure to help me get a membership test with minimum space and time without retrieving the item ? sorry again for my mistake and yhanks alot dear – Duaa Jan 11 '11 at 02:24
1

A System.Collections.Hashtable back in .NET 1.0 is really just the same as System.Collections.Generic.Dictionary, which it is introduced in .NET 2.0.

I would suggest you to use Dictionary since it is type safe by specifying your key and your value type. Hashtable only takes a object type, you will have to cast it back to a string every time you retrieve the data.

dsum
  • 1,433
  • 1
  • 14
  • 29
  • Thanks for your reply, I will support you with the needed details ... I just want a structure to test a membership of an item whether it is exist or not ... sorry if I wrote (retrieve), this is a mistake ... Also I just consern to store the (4000) strings only without any value, to test if any item exist or not without retrieving ... My strings are hex only; such as: 25AC7B2A, SO please can you tell me which is best structure to help me get a membership test with minimum space and time without retrieving the item ? sorry again for my mistake and yhanks alot dear – Duaa Jan 11 '11 at 02:25
  • 1
    HI, if you only need to test if a membership of an item exist in a structure or not, the best will be using System.Core.HashSet. It is fast because it is a hash, and it prevent duplicate data in the set. Its size is smaller than dictionary since you don't need to store the key. Hashset only stores values. – dsum Jan 12 '11 at 05:10