5

Scenario: Think you have got 90TB of text in 200 tables. This is structured related data. compareable to dbpedia only more data. Any really relational and distributed and performant database would do the job. Don’t expect as many updates as a social network but about 500read queries/s 20updates/s But main feature required besides those is running big analyses on the database in high speed since the data shall be reworked and improved with machine learning like apache mahout constantly.

Now the first issue is, which database technologies to start with (or to wait for them beeing relased) to first maintain all that data with a relativly low amount of webvisitors but a high demand on analysis/machine learning running fast? And second, which other databases to keep track of for special particular purposes that may occure and which to drop off the list or to put in pairs of which only one(/the better) should be applyed.

Cloudera/Brisk (Cassandra,Hive)
mysql(cluster), mariadb
Berkeley DB
drizzle, nimbusdb,
scidb (http://www.theregister.co.uk/2010/09/13/michael_stonebraker_interview/)
mongodb
datadraw
neo4j
Jonas
  • 161
  • 4
  • 13
  • 90 TB? Your fingers must really hurt from typing all that text ;-) – Johan Apr 21 '11 at 08:20
  • 1
    What kind of queries will you run on it? Start your question with this, please. – Vladislav Rastrusny Apr 21 '11 at 09:23
  • You might want to ask this on the [DBA](http://dba.stackexchange.com/) site instead. – Bill the Lizard Apr 21 '11 at 11:23
  • 200 tables is a massive amount of tables for a data warehouse, without knowing how EXACTLY the data is being used and transformed to get reports - the fastest solution will be some sort of Map/Reduce implementation (Hadoop + Cassandra comes to mind as one). You should expand your question because knowing how the data is transformed helps (if it's google-like where you store everything possible and then do queries based on patterns of text found then Map/Reduce platform beats anything else). – Michael J.V. Apr 21 '11 at 15:44

2 Answers2

2

Sounds like a good fit for Cassandra + Hadoop. This is possible with a little effort today; DataStax (where I work) is introducing Brisk (also open source) to make it easier: http://www.datastax.com/products/brisk

jbellis
  • 19,347
  • 2
  • 38
  • 47
  • i was on your site before, sorry i did not mention it, feel free to point out advantages to cloudera when applying cassandra + hadoop – Jonas Apr 22 '11 at 02:54
2

But main feature required besides those is running big analyses on the database in maximum speed

So now all you need is 90TB+ of RAM and you're set. "Maximum" speed is a very relative concept.

I have got about 90TB of text in a ~200 tables. This is structured related data. Any true relational distributed and per formant database would do the job.

What is a "true relational distributed database"?

Let's flip this around. Let's say that you had 90 servers and they each held 1TB of data. What's your plan to perform joins amongst your 200 tables and 90 servers?

In general, cross-server joins, don't scale very well. Trying to run joins across 90 servers is probably going to scale even less. Partitioning 200 tables is a lot of work.

which other databases to keep track of generally in this context and which to drop off the list

OK, so there are lots of follow-up questions here:

  • What are you running right now?
  • What are your pain points?
  • Are you really planning to just drop in a new system?
  • Is there a smaller sub-system that can be tested on first?
  • If you have 200 tables, how many different queries are you running? Thousands?
  • How do you plan to test that queries are behaving correctly?
Gates VP
  • 44,957
  • 11
  • 105
  • 108