3

I'm looking for a data structure or a combination of various data structures that perform very well on random and sequential access.

I need to map an (integer) id to a (double) value and sort by that value. The values can occur multiple times.

The amount of data can possibly be large.

Insertion or deletion are not critical. Iteration and Get Operations are.

I'm using Java. Currently I have a Guava Multimap, built from a TreeMap and ArrayList for sequential access. For random access I use a HashMap in parallel.

Any suggestions?

thertweck
  • 1,120
  • 8
  • 24
  • Are your IDs in a certain range or can they theoretically be any arbitrary integer? – isnot2bad Nov 11 '13 at 11:38
  • They can be any arbitrary integer or even longs and there can be many of them. Nevertheless the range can be assumed to be known. – thertweck Nov 11 '13 at 11:42

1 Answers1

1

When insertion and deletion are not critical, then a sorted array might be your friend. You could search there directly via Arrays.binarySearch and you custom Comparator.

In case you don't know any sane upper bound on the size, you can switch to an ArrayList (or implement you own resizing, but why...).

I guess this could be faster then the TreeMap, which is good when insertion and/or deletion are important, but suffers from bad spatial locality (binary tree with many pointers to follow).

The optimal structure would place all the data in a single array, which is impossible in Java (you'd need C struct for this). You could fake it by placing doubles into longs, this is sure to work and to be fast (Double.doubleToLongBits and back are intrinsics, and the length of both datatypes is 64 bits). This would mean a non-trivial amount of work, especially for sorting (if this is uncommon enough, a conversion in some sortable array and back would do).

In order to get faster search, you could use hashing, e.g., via a HashMap pointing to first element and linking the elements. As you keys are ints, some primitive-capable implementation would help (e.g. trove or fastutils or whatever).

There are countless possibilities, but keeping all your data in sync can be hard.

maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • @Dukeling: There's no such thing as array of objects in Java. Something like `new Object[....]` is an array of references and you pay one indirection. – maaartinus Nov 11 '13 at 12:28
  • "..., which is impossible in Java (you'd need C struct for this)" - can you elaborate on this? What about an array of object **references**? Happy? (my question still stands) – Bernhard Barker Nov 11 '13 at 12:35
  • There's nothing wrong with an array of object references. It's just that it in your case may mean a full (L3) cache miss for following the references. A sequential access to the array elements switches on cache prefetching and you can work with the full L1 speed. This means 2 cycles on most current CPUs as opposed to 50-100 for memory access. Am I happy now? – maaartinus Nov 11 '13 at 12:41
  • Thank you for your answer, the discussion and for the hint to trove and fastutils. As memory is a concern and the structure has to be dumped to memory in some way I think I might go with trove's TLongArrayList, alternating the keys and values. A TLongIntHashMap would lend itself for indexing the keys. I would do the sorting offline in favor for fast online query times. Did I understand you right with this? As a memory-performance tradeoff I would only read a part of the array in memory, and access the rest from disk based on the hash index. – thertweck Nov 11 '13 at 13:07
  • @kruemel: If you're concerned about memory, then there may be a problem with the `TLongArrayList`. I guess it's limited to `Integer.MAX_VALUE` elements, which with `long`s means 16GB (I hope you have more memory). Disk is so much slower than memory... you can sort on disk rather efficiently, but random access is a real pain. How much memory do you need? – maaartinus Nov 11 '13 at 13:14
  • The limit of TLongArrayList is fortunatly not a show stopper. I need many of those structures, around 60, each with certainly not more than 100 000 entries. This is not much. The intention is to reduce the memory footprint. An Implementation with ArrayList and HashMap would maybe use twice the memory, wouldn't it? Background: these lists represent scores of the same objects. The idea is to be able to aggregate the top-k of them very fast. – thertweck Nov 11 '13 at 13:32
  • Something like 2-4 times as much. You save a lot of memory and speed by doing this, just not forget that memory is cheap and (developer's) time is money. – maaartinus Nov 11 '13 at 13:57