3

What is the best solution if I need to have a database with a billion+ objects and I need to have immediate (or nearly immediate) access to any of the items in the database at any time.

This database would be queried at about 1000 requests per second. The rows in the database are pretty much unrelated and thus doesn't need to be relational.

If you're curious why, it's for a simulation of moving elements.

I was thinking something like several load balanced clusters of a Cassandra that are accessed through a load balanced cluster of web servers.

Money is a factor so the cheaper the better. There is no restriction on the software or tool it however must be open source.

Just looking for a database solution that would be good at handling a ridiculous amount of data (does not need to be relational at all) by a large number of users.

It is essential that it handle redundancy and failures.

Just a high level idea to put me in the right direction would be great.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
jreid42
  • 1,240
  • 1
  • 12
  • 19
  • On average, how many objects would the 1000 req/sec need to collect from your billions? How much correlation would there between the data selected by the different requests? How big is each of the billion or more objects? How are the objects identified? – Jonathan Leffler Jul 21 '10 at 17:46
  • Say about 10 or 20 per request. There is no correlation (there is but it will be calculated client side). Each object is really just say about 10 plain text attributes and 3D positional data. The objects would be identified by a unique key (or if the user was scoped to a specific location) they would need to be able to see all objects within X units ( so there would need to be an ability to query the db for only results within a range in the X, Y, and Z. The other attributes could be queried also, but would only result in about 10 to 20 or 100 at most objects being returned). – jreid42 Jul 21 '10 at 17:57
  • To clarify you cannot say give me all objects with attribute z == this. It would always be... give me all objects within 200 units of X,Y,Z... then you could additionally filter by their attributes (but this could be done client side as there wouldn't be that many in the same region). – jreid42 Jul 21 '10 at 17:59
  • Why do I cringe reading the words billions of rows and the cheaper the better. – HLGEM Jul 21 '10 at 18:28
  • 1
    I just mean that I want to do the right thing and not simply throw hardware (needlessly) at the problem. – jreid42 Jul 21 '10 at 19:44

2 Answers2

1

One option to consider is mapping your 3D coordinates onto a space-filling curve, effectively representing a point as a single value. Then you could run Cassandra's range queries to get points in an area.

I've seen this implemented in 2D space before, I'm sure it's possible in 3D as well.

Andrew
  • 3,272
  • 2
  • 25
  • 26
0

Since you will need to be able to efficiently get all objects within a 3D interval (X_min <= X_obj <= X_max & Y_min <= Y_obj <= Y_max & Z_min <= Z_obj <= Z_max), I am not sure how well a key-value store like Cassandra will suit you. It may be worthwhile to as well have a look at MongoDB since I believe this allows you to index multiple fields and query based on intervals.

Chris
  • 352
  • 3
  • 10
  • I've heard that MongoDB isnt the greatest in terms of protecting your data. – jreid42 Jul 22 '10 at 22:44
  • MongoDB is as good as any other DB under good conditions. It does admit that hardware fails, and unless you have the data on two or three different machines, you can't be certain its safe. – Alister Bulman Jul 25 '10 at 21:01
  • Cassandra also allows indexing multiple fields and querying based on intervals. – the paul Apr 26 '12 at 21:43