How to sort items for faster insertion in the MapDB BTree?

Question

so I have a list of around 20 million key value pairs, and I'm storing the data in several MapDB's differently to see how it affects my programs performance, and for experiment sake.

The thing is, it takes quite a lot of time to insert (in random order) 20 million key-value pairs into a mapdb. So, I would like to sort the list of key-value pairs I have so I can insert them faster, and thus build databases faster out of them.

So, how would I go about this?

I'd like to learn how to do this for MapDB's BTreeSet and BTreeMap, or, MapDBs that use single key-value pairs and MapDBs that have multiple values for a single key.

EDIT: I forgot to mention, the key-value pairs are String objects.

It's an interesting question. Most of the time when inserting in to a BTree is consumed by the tree reorganization. So, given premise that, in the end the BTree will be effectively identical, regardless of insertion order, is there an order of insertion that reorganizes the tree less. (Note, I know the tree will not be perfectly identical, but I think, especially with such a large data set, any differences are inconsequential). In fact, an in order insert may cause more BTree balancings than a random order. But I'm not sure. Curious to see if anyone has any insight. — Will Hartung, Aug 28 '14 at 16:01

score 2 · Accepted Answer · answered Sep 16 '14 at 09:48

Use build in Data Pump to create new BTreeMap. It has linear speed with number of records. It will sort data even if they do not fit into memory.

Map newMap = db.createTreeMap("map")
    .pumpSource(randomIterator)  //source of data to import
    .pumpBatchSize(1000000)      //sort data from source, batch size must be set so it fits into memory
    .make()

How to sort items for faster insertion in the MapDB BTree?

1 Answers1