Java Berkeley DB read performance w/ 100M documents

Question

I'm wondering if Berkeley DB JE is a suitable choice to store a simple key/value pair for 100M of documents.

I need to achieve <75ms at fetching time on BDB, fetching one document.

Thanks in advance

100 megabytes of documents, or 100 million documents? How big is a "document?" — Matt Ball, Apr 04 '11 at 15:55
If you can keep all the data in memory, you shouldn't have a problem. — Peter Lawrey, Apr 04 '11 at 15:55
100 Million documents. string 20 maxlen as key, string 20 maxlen as value. Intel i5, 6Gb RAM, 7.2k rpm SATA HDD. — Samuel García, Apr 04 '11 at 16:26
So you can store almost all the data in memory. I would expect you should get <<75 ms, possible less than 1 ms, most of the time depending on how random your data access is. If this doesn't perform, I would suggest you consider buying a server with more memory. e.g. You can buy a 32 GB server for $3000 (less for a smaller one) — Peter Lawrey, Apr 04 '11 at 16:48

score 0 · Answer 1 · answered Apr 04 '11 at 16:18

0

Why not use Apache Lucene - an open source Information Retrieval engine? I would use lucene to keep an index: keywords to documents ids. You can now post a keyword (or a set of keywords) to lucene, get an id of document, and retrieve the document from Berkley DB.

answered Apr 04 '11 at 16:18

Skarab

6,981
13
48
86

This approach is used to locate server shard on Solr cluster. As we cant know (w/o querying whole cluster) the current location of a given document, we are playing with a whole shard/document index developed over BDB. – Samuel García Apr 04 '11 at 16:29
Ok. Could you provide more information in your question so it is easier to address your issue? From my experience -- disclaimer: I did not work on production systems but in research/prototype development -- it is not good idea to make database to do job of IR engine. – Skarab Apr 04 '11 at 16:46

score 0 · Answer 2 · answered Apr 05 '11 at 04:33

You may want to discuss your performance requirements on the Berkeley DB Java Edition discussion forum. The main question is going to end up being "How many I/Os do you need to perform in order to get to the data?" If the answer is "none", then 75 ms response time is a piece of cake. If the answer is "many" then it will depend on how many "many" is and the speed of your disk drive.

There are some excellent quick references on the BDB JE FAQ page. In particular, this one may be of immediate use. Basically, you want to size your cache so at least all of the Index Nodes fit in memory. If the Index Nodes fit in memory, then you'll have to do at most one I/O to get to the data (Leaf Node) unless it's already in the cache.

Java Berkeley DB read performance w/ 100M documents

2 Answers2