22

I heard that B-Tree databases are faster than Hash tables, so I thought of using a B-Tree database for my project. Is there any existing framework in python which allows us to use such Data structure or will I have to code from scratch?

Nikolay Fominyh
  • 8,946
  • 8
  • 66
  • 102
Rahul
  • 11,129
  • 17
  • 63
  • 76
  • 8
    This is a good occasion to avoid prematurely optimizing your application. Just get a working application and then you cand look for opportunities to improve performance if it is warranted. By the way, you can always try putting 'python b-tree' into Google for the answer to your question. – Adam Crossland Oct 11 '10 at 20:21
  • 1
    well i do have a prototype of my application , but the problem is the data sets which i have to handle are literally closer to million , conventional hashing cannot get me such high speeds .. so thought of venturing out to B-Trees . – Rahul Oct 11 '10 at 20:48
  • What's up with all the downvotes? (i upvoted just to counter.) If you think this question and answers aren't up to par, please comment. – Paul Sasik Oct 11 '10 at 21:06
  • One million dict entries is nothing - I work with much larger datasets than that on a regular basis - and a hash table (like Python'd dict type) will almost always be faster than a btree anyway. Again, benchmark your solution before trying to optimize it. – Kirk Strauser Oct 11 '10 at 21:52
  • If your program is slow it won't be because of Hash vs B-tree. the botleneck will be somewhere else. With just a million entries, you could perhaps cache them all in RAM? – Gurgeh Mar 22 '12 at 09:49
  • 32
    I am so sick of this premature optimization argument. Make careful technology choices upfront - there's is debt to be incurred from the wrong ones. I often hear "I'll go back and do that later" and yet I find code 6 years old that still has the same TODO in it. Don't let others keep you from finding out if list.indexOf(123) is reasonable in speed vs 123 in set() - It's not premature optimization to find out ;) – Ben DeMott May 08 '12 at 20:07

6 Answers6

32

The only reason to choose a B-Tree over a hash table, either in memory or with block storage (as in a database) is to support queries other than equal. A b-tree permits you perform range queries with good performance. Many key-value stores (such as berkley db) don't make this externally visible, though, because they still hash the keys, but this still lets you iterate over the whole dataset quickly and stably (iterators remain valid even if there are adds or deletes, or the tree must be rebalanced).

If you don't need range queries, and you don't need concurrent iteration, then you don't need b-trees, use a hash table, it will be faster at any scale.

Edit: I've had occasion for the above to actually be true; for this, the blist package seems to be the most complete implementation of a sorted container library.

SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
  • Berkeley DB certainly lets you do range queries with cursors. See http://docs.oracle.com/cd/E17076_02/html/gsg/CXX/Positioning.html – Gurgeh Mar 22 '12 at 09:44
  • 2
    The characterization about "the only reason to choose a B-Tree over a hash table, either in memory or with block storage ... is to support queries other than equal" is incorrect. In addition to range properties, b-trees provide efficient ordered traversal. This can be very important. – Christopher May 17 '13 at 23:07
  • 1
    "ordered traversal" is a concept closely related to range queries, and so I'm lumping them together in my answer. – SingleNegationElimination May 18 '13 at 00:23
  • There is one other big reason to use btrees - guaranteed worst case performance. Hash tables are only fast if noone is trying to DOS you via hash collision. – Antimony Oct 10 '16 at 04:06
  • Sadly the project is dead. And Raymond did the arbitrary squash to keep a faster list out of Python. – Charles Merriam Dec 20 '19 at 20:47
  • I recommend being *very* careful when making claims such as "The only reason to choose a B-Tree over a hash table [is] ...". It only takes one counter-example to prompt a foot-mouth collision. – David J. Apr 07 '20 at 20:37
  • @CharlesMerriam Do you have a reference about this? I only watched a video long ago about improvements on `dict`, but did Raymond also improved the list implementation in latest versions of Python? Does it now use a btree internally? A link would be great! :) – Basj Dec 27 '20 at 22:58
  • Such was long, long ago. Legends speak of a day before COVID when people would gather at places such as "PyBay" to talk of these things. Now, those days are lost in the mists of time. – Charles Merriam Dec 29 '20 at 22:07
4

you should really check out zodb. http://www.zodb.org/en/latest/

i made a monography about it long go, though its in spanish http://sourceforge.net/projects/banta/files/Labs/zodb/Monografia%20-%20ZODB.pdf/download

Information in english is all over the place.

Nande
  • 409
  • 1
  • 6
  • 11
3

Program what you are trying to do first, then optimize if needed. Period.

EDIT:

http://pypi.python.org/pypi/blist

Drop in replacement for python's built in list.

user318904
  • 2,968
  • 4
  • 28
  • 37
  • Technically this is a part of my program , i dont want to use conventional DB like MySQL .. and i have been told to keep in mind that data insertion will be in large sets – Rahul Oct 11 '10 at 20:44
  • So the constant lookup/access time offered by hash tables is not fast enough for what you are doing and you are looking towards b-trees to speed things up? I suggest reading about b-trees and hashes before asking questions about them. – user318904 Oct 11 '10 at 20:56
  • well i did some basic literature survey and came across this http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/ the statistics mentioned embolden me to go for B-Trees , unfortunately there isnt a python implementation of program. – Rahul Oct 11 '10 at 21:24
  • The scaling considerations of tokyo cabinet do not apply to your project. even if you need to scale out to hundreds of nodes, you will need to optimize for your own case. – SingleNegationElimination Oct 12 '10 at 07:23
  • ...and that article shows better times for hashes than b-trees in every instance. – user318904 Oct 12 '10 at 21:03
2

SQLite3 uses B+ Trees internally, but it sounds like you may want a key-value store. Try Berkeley DB for that. If you don't need transactions, try HDF5. If you want a distributed key-value store, there is also http://scalien.com/keyspace/, but that is a server-client type system which would open up all sorts of NoSQL key-value stores.

All of these systems will be O(log(n)) for insertion and retrieval, so they will probably be slower than the hash tables that you're currently using.

Kyoto Cabinet offers a hash tree, so that may be more of what you're looking at since it should be O(1) for insertion and retrieval, but you can't do in-order traversal if you need that (although since you're currently using hash trees, this shouldn't be an issue).

http://fallabs.com/kyotocabinet/

If you're looking for performance, you will need to have the speed critical items implemented in a compiled language and then have a wrapper API in Python.

Eric
  • 320
  • 2
  • 7
2

You might want to have a look at mxBeeBase which is part of the eGenix mx Base Distribution. It includes a fast on-disk B+Tree implementation and provides storage classes which allow building on-disk dictionaries or databases in Python.

culebrón
  • 34,265
  • 20
  • 72
  • 110
2

Here there is a good btree pure python implementation. You can adapt it if needed.

Carlo Pires
  • 4,606
  • 7
  • 32
  • 32