Can someone provide a real-world example of how data would be structured within a Bigtable? Please talk from a search engine, social networking or any other familiar point of view which illustrates clearly and pragmatically how the row -> column family -> column combo is superior to traditional normalized relational approaches.
2 Answers
Reading the original Google white paper was helpful:
As was this comprehensive list of information sources on Google data architecture:
http://highscalability.com/google-architecture
Update: 11/4/14
A new version of the Google white paper PDF can be found here:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf

- 90,123
- 14
- 117
- 115

- 941
- 2
- 11
- 25
I believe the difference is more about the way the data are queried rather the way they are stored.
The main difference between relational databases and NoSQL
is that there is, um, no SQL
in the latter.
This means you (not the query optimizer) write the query plans yourself.
This may increase the query performance if you know how to do that.
Consider a typical search engine query: find top 10
pages with all (or some) words included, say, "wet t-shirt contest", ordered by relevance (we're leaving word proximity aside for simplicity sake).
To do this, you need all words split and kept in a searchable and iterable list ordered by (word, relevance, source)
. Then you partition this list into (3 * ranks)
sets (each starting at the top of the words in your search query at a given rank), where ranks
is the possible number or ranks, say, 1
to 10
; and join the sets on source
, .
In a relational database it would look like this:
SELECT w1.source
FROM ranks r1
JOIN words w1
ON w1.word = 'wet'
AND w1.rank = r1.value
CROSS JOIN
ranks r2
JOIN words w2
ON w2.word = 'shirt'
AND w2.rank = r2.value
AND w2.source = w1.source
CROSS JOIN
ranks r3
JOIN words w3
ON w3.word = 'contest'
AND w3.rank = r2.value
AND w3.source = w1.source
ORDER BY
relevance_formula (w1.rank, w2.rank, w3.rank)
LIMIT 10
This is best executed using a MERGE JOIN
over the three sets partitioned by rank.
However, no optimizer I'm aware of will build this plan (leaving aside the fact that relevance_formula
may not distribute over the individual ranks).
To work around this, you should implement your own query plan: start at the top of each word/rank pair and just descend all three sets simultaneously, skipping the missing values and using search
rather then next
if you feel that there will be too much to skip in one of the sets.
Thus said, relational approach gives you a more convenient way to query data at cost of possible performance penalty.
If you are developing a campus web server, then writing those SELECT *
is OK even they are executed one microsecond longer than they possibly could be. But if you're developing a Google, it worth spending some time on optimizing the queries (which pure relational systems only allowing access to their data using SQL
just would not let to do).
The such called NoSQL
and relational databases sometimes diffuse into each other. For instance, Berkeley DB
is a well-known NoSQL
storage engine which was used by MySQL
as its storage backend to allow SQL
queries. And vice versa, HandlerSocket
allows pure key-value queries to a relational InnoDB
store with a MySQL
database built over it.

- 413,100
- 91
- 616
- 614
-
Altrough your post makes valid poitns, there is a big difference in how data is stored. HandlerSocket is exactly for skipping the sql layer of the RDBMS when all you want is to get row by it's index. You can use queries in document based datastore. The document-model stores, graph stores, key/value stores - each stores data differently in order to allow the diffenet - more effective - way to query the data. Ofter data is denormalized for performance purposes even in a rational database. – Maxim Krizhanovsky Jul 21 '11 at 11:33
-
@Darhazer: in different relational databases the data are stored differently: in `PostgreSQL` there are no clustered tables while in `InnoDB` there are not unclustered ones. There are many things I missed of course but if I tried to cover all things I would hit `30K post size * 30 answers per post` limit. – Quassnoi Jul 21 '11 at 11:45
-
Yeah but this difference is in the physical organisation of the data only, while the question is about data modeling. – Maxim Krizhanovsky Jul 21 '11 at 11:51
-
3How would the wet T-shirt data be organized in BigTable? – S. Valmont Jul 24 '11 at 00:16