4

I am about to index 10 million titles with their IDs(for now their line numbers), titles will be stored after tokenising them. The structure of the data has to be something like <String, Arraylist<Integer>>. Strings will represent the tokens, Integers will represent line numbers.

I have to build this tool using: Java, persistent memory, not using RDBMS as possible. As this data structure is mutable, I couldn't find any tools that support MultiMaps with the structure > to be indexed using BTree or any other persistent data structures.

I tried MapDB, but turned to only accept immutable, which in my case doesn't apply (Arraylist)

Any thoughts are appreciated.

EurikaIam
  • 136
  • 9
  • What about [Guava Multimap](http://guava-libraries.googlecode.com/svn/tags/release03/javadoc/com/google/common/collect/Multimap.html)? – Luiggi Mendoza Feb 28 '13 at 14:23
  • Guava Multimap seems to be in-memory storage. – EurikaIam Feb 28 '13 at 14:25
  • "persistent memory" - wait, so do you want the data to be in-memory, or do you want it to be persistent? (I.e. stored on a disk.) If in-memory, just use a `HashMap`. If on-disk, then a BTree is the right choice, but I doubt you're going to get a good library for that that's not a RDBMS. (Seeing as "something that writes BTrees to disk" is a good description of the guts of a RDBMS.) – millimoose Feb 28 '13 at 14:26
  • 1
    If you just want a lightweight persistent key-value data store, consider [`Kyoto Cabinet`](http://fallabs.com/kyotocabinet/) - you'd just have to handle the multimap functionality yourself by (de)serializing your data into the `String`s/`byte[]`s it handles. This might be slow to create if you can't cluster updates to a given key, but reasonably fast to read which is arguably the point of indexing. – millimoose Feb 28 '13 at 14:29
  • So you need to index the data and then store the indexed data it in some random access persistent storage (used for searches for example)? – RudolphEst Feb 28 '13 at 14:52
  • @RudolphEst, As the index has to be persistent (on hard disk), it is preferable to access data sequentially, not randomly. – EurikaIam Feb 28 '13 at 15:24
  • @EurikaIam I am not sure what use such and index would be ... but OK. Next question is what do you mean by mutable? Do you mean your program is continuously changing the entire index or changing values at an index (which is by definition random access)? I do not see why you cannot use any of the document oriented or object oriented DBs out there (MapDB, MongoDB, CouchDB for key->value storage) if you don't want to write your own B-Tree persister or use an RDBMS. – RudolphEst Feb 28 '13 at 15:51
  • @Ingo using JavaDB would require SQL, foreign keys and a lot of other RDBMS and ORM fidgeting, which is what I assume the poster would rather avoid. – RudolphEst Feb 28 '13 at 16:00
  • @RudolphEst If this is so then I'd would write this thing in perl using Berkely DB. -- Aren't thos requirements funny? It's like: I want to climb to the Mount Everest, but without Oxygen (RDBMS) and of course, with sandals (Java). – Ingo Feb 28 '13 at 16:05
  • @RudolphEst have a look at this [link](https://groups.google.com/forum/?fromgroups=#!topic/mapdb/VYXt1KJn3N0). There is more info there about my case. – EurikaIam Feb 28 '13 at 16:15
  • @Ingo Which is why the poster suggested he wanted to use MapDB (which does pretty much the same thing). Since there are Java bindings for Berkeley, that should work too. I don't know if I would call RDBMS oxygen though, nor Java sandals... I personally cannot stand all the switching between relational and object oriented data structures. (Object trees do not store transparently in rows and columns) – RudolphEst Feb 28 '13 at 16:17
  • @EurikaIam I think I understand your problem now, and I am sure that MapDB is the embedded solution that you need (as Jan also replied of the Google Group link). Let me do a couple of tests on my side, and I will post some example code if it works. – RudolphEst Feb 28 '13 at 16:25
  • @EurikaIam I have been rather busy at work and at home, will try to get to it tomorrow, and post it here. I think I will might not be able to return `ArrayList` when searching the map, but will definitely still return a `Collection`, which can either be copied into an `ArrayList` or (preferably) `Iterated` over. – RudolphEst Mar 02 '13 at 23:34

1 Answers1

1

What you need is called MultiMap. MapDB does not support those directly, but has composite sets which are almost as good.

Example is here: https://github.com/jankotek/MapDB/blob/release-1.0/src/test/java/examples/MultiMap.java

fractaloop
  • 302
  • 1
  • 7
Jan Kotek
  • 1,084
  • 7
  • 4
  • Hi Jan. I tried what you suggested. The only issue is the size of the resulted index. 591.1 MB for 19,177,268 tokens with their IDs. This is just 10% of the whole tokens that not yet added to the index. I used NavigableSet> map1 = db.getTreeSet("test"); Do you think the size of the index can be reduced by any ways? or Is it the nature of serialisation in Java? – EurikaIam Mar 06 '13 at 12:20
  • Make sure you call db.compact() to defragment storage. Also we are planning to implement delta packing for tuples, which will dramatically reduce index size (will be implenented soon) – Jan Kotek Mar 08 '13 at 23:43