Bigtable Practical Example

Question

Can someone provide a real-world example of how data would be structured within a Bigtable? Please talk from a search engine, social networking or any other familiar point of view which illustrates clearly and pragmatically how the row -> column family -> column combo is superior to traditional normalized relational approaches.

score 9 · Accepted Answer · edited Nov 04 '14 at 15:29

Reading the original Google white paper was helpful:

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/bigtable-osdi06.pdf

As was this comprehensive list of information sources on Google data architecture:

http://highscalability.com/google-architecture

Update: 11/4/14

A new version of the Google white paper PDF can be found here:

http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf

score 4 · Answer 2 · answered Jul 21 '11 at 11:20

I believe the difference is more about the way the data are queried rather the way they are stored.

The main difference between relational databases and NoSQL is that there is, um, no SQL in the latter.

This means you (not the query optimizer) write the query plans yourself.

This may increase the query performance if you know how to do that.

Consider a typical search engine query: find top 10 pages with all (or some) words included, say, "wet t-shirt contest", ordered by relevance (we're leaving word proximity aside for simplicity sake).

To do this, you need all words split and kept in a searchable and iterable list ordered by (word, relevance, source). Then you partition this list into (3 * ranks) sets (each starting at the top of the words in your search query at a given rank), where ranks is the possible number or ranks, say, 1 to 10; and join the sets on source, .

In a relational database it would look like this:

SELECT  w1.source
FROM    ranks r1
JOIN    words w1
ON      w1.word = 'wet'
        AND w1.rank = r1.value
CROSS JOIN
        ranks r2
JOIN    words w2
ON      w2.word = 'shirt'
        AND w2.rank = r2.value
        AND w2.source = w1.source
CROSS JOIN
        ranks r3
JOIN    words w3
ON      w3.word = 'contest'
        AND w3.rank = r2.value
        AND w3.source = w1.source
ORDER BY
        relevance_formula (w1.rank, w2.rank, w3.rank)
LIMIT 10

This is best executed using a MERGE JOIN over the three sets partitioned by rank.

However, no optimizer I'm aware of will build this plan (leaving aside the fact that relevance_formula may not distribute over the individual ranks).

To work around this, you should implement your own query plan: start at the top of each word/rank pair and just descend all three sets simultaneously, skipping the missing values and using search rather then next if you feel that there will be too much to skip in one of the sets.

Thus said, relational approach gives you a more convenient way to query data at cost of possible performance penalty.

If you are developing a campus web server, then writing those SELECT * is OK even they are executed one microsecond longer than they possibly could be. But if you're developing a Google, it worth spending some time on optimizing the queries (which pure relational systems only allowing access to their data using SQL just would not let to do).

The such called NoSQL and relational databases sometimes diffuse into each other. For instance, Berkeley DB is a well-known NoSQL storage engine which was used by MySQL as its storage backend to allow SQL queries. And vice versa, HandlerSocket allows pure key-value queries to a relational InnoDB store with a MySQL database built over it.

Altrough your post makes valid poitns, there is a big difference in how data is stored. HandlerSocket is exactly for skipping the sql layer of the RDBMS when all you want is to get row by it's index. You can use queries in document based datastore. The document-model stores, graph stores, key/value stores - each stores data differently in order to allow the diffenet - more effective - way to query the data. Ofter data is denormalized for performance purposes even in a rational database. — Maxim Krizhanovsky, Jul 21 '11 at 11:33
@Darhazer: in different relational databases the data are stored differently: in `PostgreSQL` there are no clustered tables while in `InnoDB` there are not unclustered ones. There are many things I missed of course but if I tried to cover all things I would hit `30K post size * 30 answers per post` limit. — Quassnoi, Jul 21 '11 at 11:45
Yeah but this difference is in the physical organisation of the data only, while the question is about data modeling. — Maxim Krizhanovsky, Jul 21 '11 at 11:51

Bigtable Practical Example

2 Answers2