Appropriate data structure for faster retrieval process (data size: around 200,000 values all string)

Question

I have a large data set of around 200, 000 values, all of them are strings. Which data structure should i use so that the searching and retrieval process is fast. Insertion is one time, so even if the insertion is slow it wouldn't matter much.

Hash Map could be one solution, but what are the other choices?? Thanks

Edit: some pointers 1. I am looking for exact matches and not the partial ones. 2. I have to accomplish this in PHP. 3. Is there any way i can keep such amount of data in cache in form of tree or in some other format?

Can you be precise about how you need to search and retrieve? Raze2dust's answer assumed a particular meaning for "search", I think. Do you just need to look up exact matches? Or do you need to "find closest"? — Ed Staub, Sep 14 '11 at 18:27
@Ed: I am looking for exact matches and not the partial ones — Elvis, Sep 15 '11 at 04:43

score 1 · Answer 1 · answered Sep 14 '11 at 20:53

You really should consider not using maps or hash dictionaries if all you need is a string lookup. When using those, your complexity guaranties for N items in a lookup of string size M are O(M x log(N)) or, best amortised for the hash, O(M) with a large constant multiplier. It is much more efficient to use an acyclic deterministic finite automaton (ADFA) for basic lookups, or a Trie if there is a need to associate data. These will walk the data structure one character at a time, giving O(M) with very small multiplier complexity.

Basically, you want a data structure that parses your string as it is consumed by the data structure, not one that must do full string compares at each node of the lookup. The common orders of complexity you see thrown around around for red-black trees and such assume O(1) compare, which is not true for strings. Strings are O(M), and that propagates to all compares used.

I stated not to use either maps or hash dictionaries, and gave the first's complexity and the latter's amortised complexity. — ex0du5, Sep 15 '11 at 14:46

score 1 · Answer 2 · answered Sep 15 '11 at 06:17

1

Maybe a trie data structure.

A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings

answered Sep 15 '11 at 06:17

Juraj Blaho

13,301
7
50
96

score 0 · Answer 3 · answered Sep 14 '11 at 18:10

0

Use a TreeMap in that case. Search and Retrieval will be O(log n). In case of HashMap search can be O(n) worst case, but retrieval is O(1).

For 200000 values, it probably won't matter much though unless you are working with hardware constraints. I have used HashMaps with 2 million Strings and they were still fast enough. YMMV.

answered Sep 14 '11 at 18:10

Hari Menon

33,649
14
85
108

Never implemented HASH, i think now its the good time to work on it :) – Elvis Sep 14 '11 at 18:24
keep in mind I used Java Hashmaps though. TreeMap in Java is also an implementation of Red-black trees. – Hari Menon Sep 14 '11 at 18:27

score 0 · Answer 4 · answered Sep 14 '11 at 18:11

0

You can B+ trees if you want to ensure your search is minimal at the cost of insertion time.

You can also try bucket push and search.

answered Sep 14 '11 at 18:11

MduSenthil

2,019
3
18
39

Thanks for the reply will try to implement your solution – Elvis Sep 14 '11 at 18:23

Ed Staub · Answer 5 · 2011-09-15T12:53:21.657

0

Use a hashmap. Assuming implementation similar to Java's, and a normal collision rate, retrieval is O(m) - the main cost is computing the hashcode and then one string-compare. That's hard to beat.

For any tree/trie implementation, factor in the hard-to-quantify costs of the additional pipeline stalls caused by additional non-localized data fetches. The only reason to use one (a trie, in particular) would be to possibly save memory. Memory will be saved only with long strings. With short strings, the memory savings from reduced character storage are more than offset by all the additional pointers/indices.

Fine print: worse behavior can occur when there are lots of hashcode collisions due to an ill-chosen hashing function. Your mileage may vary. But it probably won't.

I don't do PHP - there may be language characteristics that skew the answer here.

edited Sep 15 '11 at 12:53

answered Sep 15 '11 at 12:25

Ed Staub

15,480
3
61
91

Have you seen cache incoherency issues in practice? As the data set grows larger, collisions become potentially costlier, and collisions must walk nonlocal lists. Acyclic graphs (ADFAs, a.k.a. DAWGs), on the other hand, may be implemented so that leaf structures are arranged in cache coherent sets, so you usually have on average 3 or 4 nonlocal fetches (with processor predictive branching grabbing the first 1 or 2). My experience is that an ADFA is quite a bit faster _on average_ than a hash, and orders of magnitude faster worst case for data sets around the size mentioned or larger. – ex0du5 Sep 15 '11 at 22:56
Thanks. No, I haven't, but the OP said ("Insertion is one time") that this is read-only (and didn't say that it's multi-threaded, for that matter, but let's assume), so I don't see how cache incoherency would apply - can you explain? If it _is_ read-only, does it change your recommendation? Also, I can't picture the "3 or 4 non-local fetches" for 200,000 strings - how many nodes are you picturing fitting into a fetch? Are you really thinking about a PHP implementation that's cost-effective and feasible (in dev time) for the OP? – Ed Staub Sep 15 '11 at 23:27
I guess I meant stale cache misses requiring memory bus access in general, not just staleness requiring consistency restoration. Sorry for the bad question. I know a naive implementation just allocates nodes and may walk a new cache page every character, but I rarely see that being the case even with that naive code. Often, insert once tends to arrange leaf groups in the same page just by the system allocator, so once you're on the 4th or 5th node deep, you stay local. Multi-insert also tends to stay local by insert pattern matching search. – ex0du5 Sep 15 '11 at 23:43
On the other hand, I've typically seen "good" hash algorithms with good entropy, low collision, etc. taking 50 to 100 processor ticks. Collisions cause nonlocal access typically, walking the collision list. My experience is that using ADFAs typically take 1/5th the time, with spikes in the hash causing the ADFA to be 1/100th the time or better (using written language dictionaries and linguistic grammar trees - my only familiarity). – ex0du5 Sep 15 '11 at 23:49
Thanks again. It's a little hard to believe the 5x/100x number... but you've obviously worked this field a lot more than me! – Ed Staub Sep 16 '11 at 02:03

Appropriate data structure for faster retrieval process (data size: around 200,000 values all string)

5 Answers5