Datastructure choices for highspeed and memory efficient detection of duplicate of strings

Question

I have a interesting problem that could be solved in a number of ways:

I have a function that takes in a string.
If this function has never seen this string before, it needs to perform some processing.
If the function has seen the string before, it needs to skip processing.
After a specified amount of time, the function should accept duplicate strings.
This function may be called thousands of time per second, and the string data may be very large.

This is a highly abstracted explanation of the real application, just trying to get down to the core concept for the purpose of the question.

The function will need to store state in order to detect duplicates. It also will need to store an associated timestamp in order to expire duplicates.

It does NOT need to store the strings, a unique hash of the string would be fine, providing there is no false positives due to collisions (Use a perfect hash?), and the hash function was performant enough.

The naive implementation would be simply (in C#):

 Dictionary<String,DateTime>

though in the interest of lowering memory footprint and potentially increasing performance I'm evaluating a custom data structures to handle this instead of a basic hashtable.

So, given these constraints, what would you use?

EDIT, some additional information that might change proposed implementations:

99% of the strings will not be duplicates.
Almost all of the duplicates will arrive back to back, or nearly sequentially.
In the real world, the function will be called from multiple worker threads, so state management will need to be synchronized.

Has the dictionary proven to not pass your desired performance metrics? — Anthony Pegram, Apr 14 '12 at 04:54
At this point this is a conceptual question. I don't like the dictionary because I'm pointlessly storing the strings. — Jonathan Holland, Apr 14 '12 at 04:56
You won't be able to calculate a unique hash if the strings are large. — phoog, Apr 14 '12 at 04:56
You can't use a hash code to guarantee uniqueness, period. See http://blog.mischel.com/2012/04/13/hash-codes-are-not-unique/ — Jim Mischel, Apr 14 '12 at 04:59
@JimMischel if the set of possible string values is small enough, each string could be guaranteed to havea unique hash. — phoog, Apr 14 '12 at 05:02
@phoog: Yes. If you use a 32-bit hash and your strings are guaranteed to be no more than 4 bytes long, and you write a special hash function that treats them like integers. Or if you know what the strings are ahead of time and you construct a minimally perfect hash. But in general you cannot use a hash code to guarantee uniqueness for arbitrary strings. — Jim Mischel, Apr 14 '12 at 05:16

Alexei Levenkov · Answer 1 · 2012-04-14T06:15:18.967

5

I don't belive it is possible to construct "perfect hash" without knowing complete set of values first (especially in case of C# int with limited number of values). So any kind of hashing requires ability to compare original values too.

I think dictionary is the best you can get with out of box data structures. Since you can store objects with custom comparisons defined you can easily avoid keeping strings in memeory and simply save location where whole string can be obtained. I.e. object with following values:

stringLocation.fileName="file13.txt";
stringLocation.fromOffset=100;
stringLocation.toOffset=345;
expiration= "2012-09-09T1100";
hashCode = 123456;

Where cutomom comparer will return saved hashCode or retrive string from file if needed and perform comparison.

edited Apr 14 '12 at 06:15

answered Apr 14 '12 at 05:03

Alexei Levenkov

98,904
14
127
179

Actually strings *are* in memory, but simply are not in `dictionary`. That's why don't understand OP concern about storing them in It. – Tigran Apr 14 '12 at 05:55
@Tigran, non constant strings are normal objects elegible for GC, so if string is not referenced it may easily be collected. I.e. for `string s = "abc"; s=s+s; s= "abc";` "abc" will likely be always in memory (since it is constant in the code), but "abaabc" can be garbage collected and gone. – Alexei Levenkov Apr 14 '12 at 05:59
If strings are simply present in dictionary like `keys `, they will not collected. Don't understand your point. – Tigran Apr 14 '12 at 06:06
Because strings don't have to be used as key in this case - object that I've described can represent very long string in constant amount of memory and easily used as key. – Alexei Levenkov Apr 14 '12 at 06:09
I understand that and probably that would be my choice too. I didn't understand why you began talk about `GC` in *this* case. – Tigran Apr 14 '12 at 06:11
An interesting solution, although I question how well it's going to handle the thousands of transactions per second that the OP stated. Every dictionary lookup is going to require at least one I/O. And since he says that 99% of strings will not be duplicates, those I/O operations will be predominately writes. The file will grow without bound (as will your in-memory index), unless you come up with some way to do garbage collection (i.e. discard expired strings). – Jim Mischel Apr 15 '12 at 05:05

score 2 · Answer 2 · answered Apr 14 '12 at 05:04

a unique hash of the string would be fine, providing there is no false positives due to collisions

That's not possible, if you want the hash code to be shorter than the strings.

Using hash codes implies that there are false positives, only that they are rare enough not to be a performance problem.

I would even consider to create the hash code from only part of the string, to make it faster. Even if that means that you get more false positives, it could increase the overall performance.

Jim Mischel · Answer 3 · 2012-04-14T13:21:49.713

2

Provided the memory footprint is tolerable, I would suggest a Hashset<string> for the strings, and a queue to store a Tuple<DateTime, String>. Something like:

Hashset<string> Strings = new HashSet<string>();
Queue<Tuple<DateTime, String>> Expirations = new Queue<Tuple<DateTime, String>>();

Now, when a string comes in:

if (Strings.Add(s))
{
    // string is new. process it.
    // and add it to the expiration queue
    Expirations.Enqueue(new Tuple<DateTime, String>(DateTime.Now + ExpireTime, s));
}

And, somewhere you'll have to check for the expirations. Perhaps every time you get a new string, you do this:

while (Expirations.Count > 0 && Expirations.Peek().Item1 < DateTime.Now)
{
    var e = Expirations.Dequeue();
    Strings.Remove(e.Item2);
}

It'd be hard to beat the performance of Hashset here. Granted, you're storing the strings, but that's going to be the only way to guarantee no false positives.

You might also consider using a time stamp other than DateTime.Now. What I typically do is start a Stopwatch when the program starts, and then use the ElapsedMilliseconds value. That avoids potential problems that occur during Daylight Saving Time changes, when the system automatically updates the clock (using NTP), or when the user changes the date/time.

Whether the above solution works for you is going to depend on whether you can stand the memory hit of storing the strings.

Added after "Additional information" was posted:

If this will be accessed by multiple threads, I'd suggest using ConcurrentDictionary rather than Hashset, and BlockingCollection rather than Queue. Or, you could use lock to synchronize access to the non-concurrent data structures.

If it's true that 99% of the strings will not be duplicate, then you'll almost certainly need an expiration queue that can remove things from the dictionary.

edited Apr 14 '12 at 13:21

answered Apr 14 '12 at 05:13

Jim Mischel

131,090
20
188
351

Why would you use the queue instead of a dictionary? – Jonathan Holland Apr 14 '12 at 05:22
This strikes me as overkill. A `Dictionary` would be a far simpler approach. A single data structure and a single `if` boolean expression would be all that's needed to be the gatekeeper. – Anthony Pegram Apr 14 '12 at 05:22
@JonathanHolland: because you want the expirations in time order. If you use the dictionary, then you have to iterate over the entire dictionary to find the items that need to expire. With the queue, you always know that the next item to expire is at the head of the queue. Assuming, of course, that all strings have the same expiration times (i.e. always expire a string after 5 minutes or whatever). If different strings have different expiration periods, you'll have to use a priority queue. – Jim Mischel Apr 14 '12 at 05:24
1

if (state.ContainsKey(input) && DateTime.UtcNow > state[input]) { process(input); } That doesn't require a linear scan. – Jonathan Holland Apr 14 '12 at 05:27
@AnthonyPegram: Yes, you could use just the dictionary. But you could end up collecting a bunch of strings that were only used once, and over time they would increase your memory footprint and potentially crash the program. Using the queue lets you *remove* strings from the dictionary. – Jim Mischel Apr 14 '12 at 05:27
@Jim, there's no need to iterate. The expiration doesn't need to be "after such and such time, kick the string out." It's more of "if the string exists but it has been so long, allow it anyway" (meaning you perform the op and update the timestamp). Where's the iteration? – Anthony Pegram Apr 14 '12 at 05:27
@JonathanHolland: That will work, but you'll end up holding on to strings that are never re-used. – Jim Mischel Apr 14 '12 at 05:28
@Anthony: As I said, you can keep the string at the risk of your dictionary growing without bound. The queue allows you to remove strings that haven't been seen after a period of time, which will prevent unbounded memory usage. – Jim Mischel Apr 14 '12 at 05:30
Fair enough, given the scenario has to handle exceptional duplicates, using the queue allows for a reduced memory footprint. – Jonathan Holland Apr 14 '12 at 05:31

score 1 · Answer 4 · answered Apr 14 '12 at 05:44

If memory footprint of storing whole strings is not acceptable, you have only two choices:

1) Store only hashes of strings, which implies possibility of hash collisions (when hash is shorter than strings). Good hash function (MD5, SHA1, etc.) makes this collision nearly impossible to happen, so it only depends whether it is fast enough for your purpose.

2) Use some kind of lossless compression. Strings have usually good compression ratio (about 10%) and some algorithms such as ZIP let you choose between fast (and less efficient) and slow (with high compression ratio) compression. Another way to compress strings is convert them to UTF8, which is fast and easy to do and has nearly 50% compression ratio for non-unicode strings.

Whatever way you choose, it's always tradeoff between memory footprint and hashing/compression speed. You will probably need to make some benchmarking to choose best solution.

Datastructure choices for highspeed and memory efficient detection of duplicate of strings

4 Answers4