1

I have a bunch of Strings I'd like a fast lookup for. Each String is 22 chars long and is looked up by the first 12 only (the "key" so to say), the full set of Strings is recreated periodically. They are loaded from a file and refreshed when the file changes. I have to deal with too little available memory, other server processes on my VPS need it too and need it more.

How do I best store the Strings and search for them?

My current idea is to store them all one after another inside a char[] (to save RAM), and sort them for faster lookups (I figure the lookup is fastest if I have them presorted so I can use binary or interpolation search). But I'm not exactly sure how I should code it - if anyone is in the mood for a challenging puzzle: here it is...

Btw: It's probably ok to exceed the memory constraints for a while during the recreation / sorting, but it shouldn't be by much or for long.

Thanks!

Update

For the "I want to know specifics" crowd (correct me if I'm wrong in the Java details): The source files contain about 320 000 entries (all ANSI text), I really want to stay (WAY!) below 64 MB RAM usage and the data is only part of my program. Here's some information on sizes of Java types in memory.

My VPS is a 32bit OS, so...

  • one byte[], all concatenated = 12 + length bytes
  • one char[], all concatenated = 12 + length * 2 bytes
  • String = 32 + length * 2 bytes (is Object, has char[] + 3 int)

So I have to keep in memory:

  • ~7 MB if all are stored in a byte[]
  • ~14 MB if all are stored in a char[]
  • ~25 MB if all are stored in a String[]
  • > 40 MB if they are stored in a HashTable / Map (for which I'd probably have to finetune the initial capacity)

A HashTable is not magical - it helps on insertion, but in principle it's just a very long array of String where the hashCode modulus capacity is used as an index, the data is stored in the next free position after the index and searched lineary if it's not found there on lookup. But for a Hashtable, I'd need the String itself and a substring of the first 12 chars for lookup. I don't want that (or do I miss something here?), sorry folks...

Arne
  • 1,884
  • 1
  • 15
  • 19
  • It would help if you asked one question (at a time) based on only one narrow problem you are facing, for example: memory usage, sort algorithm, data structure. – Aaron Kurtzhals Aug 10 '12 at 19:04
  • Anything wrong with a hashtable? Also, are you performance limited at all? – bcr Aug 10 '12 at 19:05
  • I'm not performance limited, but it still shouldn't take too long. HashTable doesn't really work for me (see update above). – Arne Aug 10 '12 at 20:15
  • @AaronKurtzhals: I don't think it would help as the decisions influence each other. But I hope the additional information concening my constraints helps. – Arne Aug 10 '12 at 20:20
  • Your aversion to hash tables is hard to understand. The fact is that they are O(1) for both insertion and lookup unless the hash codes are degenerate. – user207421 Aug 11 '12 at 02:20
  • I don't have aversions to hash tables. I like them, I use them and I know a lot about them. I just dislike their memory consumption in this specific scenario - see above. – Arne Aug 12 '12 at 17:49

3 Answers3

1

I would probably use a cache solution for that, may be even guava will do. Of course sort them, then binary search. Unfortunately I do not have the time for it :(

Eugene
  • 117,005
  • 15
  • 201
  • 306
1

Sounds like a HashTable would be the right implementation for this situation.

Searching is done in constant time and refreshing could be done in linear time.

Java Data Structure Big-O (Warning PDF)

eabraham
  • 4,094
  • 1
  • 23
  • 29
1

I coded a solution myself - but it's a little different than the question I posted because I could use information I didn't publish (I'll do better next time, sorry).

I'm just answering this because it's solved, I won't accept one of the other answers because they didn't really help with the memory constraints (and were a little short for my taste). They still got an upvote each, no hard feelings and thanks for taking the time!

I managed to push all of the info into two longs (with the key completely residing in the first one). The first 12 chars are an ISIN which can be compressed into a long because it only uses digits and capital letters, always starts with two capital letters and ends with a digit which can be reconstructed from the other chars. The product of all possible values leaves a little more than 3 bits to spare.

I store all entries from my source file in a long[] (packed ISIN first, other stuff in the second long) and sort them based on the first of two longs.

When I do a query by a key, I transform it to a long, do a binary search (which I'll maybe change to an interpolation search) and return the matching index. The different parts of the value are retrievable by said index - I get the second long from the array, unpack it and return the requested data.

The result: RAM usage dropped from ~110 MB to < 50 MB including Jetty (btw - I used a HashTable before) and lookups are lightning fast.

Arne
  • 1,884
  • 1
  • 15
  • 19
  • This is probably optimal, but using two `long[]` would be much easier as you could use existing sort and binarySearch (at a cost of an L3 cache-miss per lookup). – maaartinus Oct 28 '13 at 22:07
  • No, it wouldn't. Both longs have to be next to each other (so I get all data with `index` and `index + 1`). If I split it into two arrays and use existing `sort` and `binarySearch`, they are sorted independently and the association is lost. – Arne Oct 30 '13 at 08:02
  • You're right. Concerning the sort, you'd need a temporary `HashMap` to create the second `long[]`, which would temporary cost a lot of memory. Or you own sort working on two array in parallel, but this is what I wanted to avoid. – maaartinus Oct 30 '13 at 10:34